katholt / sonneityping

Other
8 stars 2 forks source link

vcf_parsnp mode not working #1

Open RJBeng opened 4 years ago

RJBeng commented 4 years ago

Dear authors,

I am trying to use the sonnei_genotype code on assembled S. sonnei genomes with the vcf_parsnp mode but I am not getting any output.

I first run parsnp to align genomes to the 53G reference.

parsnp -r NC_016822.1.fna -d ./input -c -x -p 20 -o ./output

Then use harvestools to convert parsnp output to vcf file

harvesttools -i parsnp.ggr -V sonnei_parsnp.vcf

Finally, running sonnei_genotype code

python sonnei_genotype.py --mode vcf_parsnp --vcf sonnei_parsnp.vcf --allele_table alleles.txt --output genotypes.txt

This produces an error

Traceback (most recent call last): File "/SOFTWARE/sonneityping/sonnei_genotype.py", line 601, in main() File "/SOFTWARE/sonneityping/sonnei_genotype.py", line 591, in main output_file.write(strains[strain] + '\tNo SNPs encountered against expected reference. Wrong reference or no SNP calls?\n') UnboundLocalError: local variable 'strain' referenced before assignment

I don't know what I'm doing wrong, can someone please point me in the right direction.

Many thanks!

jhawkey commented 4 years ago

@RJBeng Are you able to provide us with the vcf file (or a small set of vcf files if you're trying more than one genome), so we can take a look and do some testing? You should be able to attach it to the issue.

jhawkey commented 4 years ago

Okay I've solved the issue! The problem is the reference ID that parsnp was putting in the VCF file. The code was expecting the reference ID to be 1, when in fact if you look at the VCF file, the reference ID (the value in the #CHROM column) is NC_016822.1.fasta.ref.

Adding --ref_id NC_016822.1.fasta.ref to your command allows the script to run correctly.

So your command should be

python sonnei_genotype.py --mode vcf_parsnp --vcf sonnei_parsnp.vcf --allele_table alleles.txt --output genotypes.txt --ref_id NC_016822.1.fasta.ref

I'm going to close this issue, but I will make a note that we need to do some checks for this in the future so the user gets a sensible error message.

jhawkey commented 4 years ago

Thanks @RJBeng, I can confirm that I can see this happening for a number of your strains when I run your vcf file through the code on my computer. I'm going to re-open this issue and investigate. Hopefully get back to you soon!

jhawkey commented 4 years ago

Okay @RJBeng I've gone through and updated the code logic to fix the software from erroneously reporting subclades that don't match the clade or lineage calls. I'm attaching the table that I got from the fixed code below, so you can check if your output matches.

You will notice that there are still instances where there isn't a full lineage.clade.subclade call for some of your strains. This is because these strains do not carry any more marker snps to further define their genotype, so we're just reporting the lowest level of genotype that we have the best support for. In some cases this means we're only able to get to the lineage level!

Reading the support column, and 'A' in the column indicates that the calls were made from assemblies, rather than reads. If you ran the code with bam files or a vcf file generated with reads, this column would report the proportion of reads that support the genotype. Because we just have assemblies in this case, we aren't able to calculate the support value. I've made a note that we need to include this explanation in the documentation.

Please note that the changes I've made to the code are on the development branch, so you'll need to pull from that branch in order to test. Once we've confirmed that our results match up with yours and there are no more issues, I'll merge the changes into the master branch for everyone to use.

RJBeng commented 4 years ago

Thank you for your help @jhawkey! I have looked through the output and everything seems to match.

Thought I could report some of my findings here. I also ran the code on bam and vcf files just to compare the results, I have noticed that using the bam and vcf mode was not able to assign the correct genotype to one subclade (it was all assigned to 2.8.2 which looking at the phylogeny does not seem to be correct). However, running this using the parsnp_vcf mode the subclade was assigned to the correct genotype. It appears to me that the parsnp_vcf mode works better, but was unable to provide prediction for QRDR mutations.

Also, would it be possible for you to remove the genotypes.txt file from this thread as it contains some novel data :)

jhawkey commented 4 years ago

@RJBeng No problem, I've removed the results file :)

For the samples that are being assigned the wrong subclade, are you able to provide the vcf files for just those samples? Also, what subclade do you think they should be assigned to based on the phylogeny?

And yes, the parsnp mode does not supply the QRDR mutations. I will have a chat to the rest of the group and find out if that's something we can implement or not!

katholt commented 4 years ago

Hi guys,

RJBeng commented 4 years ago

I looked through the vcf and bam files for the few isolates from that subclade and the SNP that determines Global III (3.7) is present in both files, I don't see heterogeneity at that position either.

jhawkey commented 4 years ago

@RJBeng Are these the vcf files being produced from bams by sonneityping, or are these vcf files you're just passing to sonneityping?

When I took those four vcf files and ran them through the sonneityping code (the code on the development branch), they are all given non 2.8.2 genotypes. The support for their genotypes was >0.98.

RJBeng commented 4 years ago

These are vcf files passed to sonneityping. I have not tired to run them on the development branch, will run them again and let you know what I get.

jhawkey commented 4 years ago

Okay, this may be the source of your issue with the bam files - the fixes I made regarding the 2.8.2 genotype reporting incorrectly will apply to all modes, not just the parsnp_vcf mode. These fixes are currently only in the development branch, not the master branch.

RJBeng commented 4 years ago

Hi @jhawkey, I have just ran the development branch code on the subclade and I am still getting 2.8.2 genotype on the bam and vcf mode.

jhawkey commented 4 years ago

Hi @RJBeng, I'm not sure what's happening here as I can't replicate your issue. Attached are the results for those 4 VCF files you've sent me, using the most up to date version of the development branch. As you can seen, none of these 4 strains are being called as 2.8.2 with the version of the code that I have. 2019.11.04_11.12.19_genotypes.txt

RJBeng commented 4 years ago

Hi @jhawkey I have just tried to run the code on my data again and this problem is no longer occurring :). However, I had to run this using the development branch as running the code using the master branch produced the following error:


  File "/pub38/rjbeng/SOFTWARE/sonneityping/sonnei_genotype.py", line 658, in <module>
    main()
  File "/pub38/rjbeng/SOFTWARE/sonneityping/sonnei_genotype.py", line 639, in main
    this_groups = checkSNPmulti(x, this_groups, args)
  File "/pub38/rjbeng/SOFTWARE/sonneityping/sonnei_genotype.py", line 167, in checkSNPmulti
    if snp in loci:
NameError: global name 'loci' is not defined