pcingola / SnpEff

Other
252 stars 81 forks source link

Annotating Mycobacterium tuberculosis VCF file using snpEFF #222

Closed SafinaAr closed 4 years ago

SafinaAr commented 5 years ago

Hi, I generated my vcf files from GATK pipeline using ploidy 1 as it is a mycobacterium tuberculosis genome. Now i want to annotate my variants using snpEFF and Annovar. I search snpEff database for mtb annotation using:

java -jar snpEff.jar download -v Mycobacterium_tuberculosis

t gave me numerous results showing that it contans the mtb database. Bit I'm not sure which one is mine/reference one that i used to generate the vcf file. My mtb reference genome file looks like this:

>M.tuberculosis_H37Rv NC_000962.3 ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcggtcgtctccgaacttaacggcgaccct

I tried buildDbNcbi.sh script from snpEFF to build my own db but it is produced the following error:

Downloading genome NC_000962 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 17.7M 0 17.7M 0 0 157k 0 --:--:-- 0:01:55 --:--:-- 483k 00:00:00 SnpEff version SnpEff 4.3t (build 2017-11-24 10:18), by Pablo Cingolani 00:00:00 Command: 'build' 00:00:00 Building database for 'NC_000962' 00:00:00 Reading configuration file 'snpEff.config'. Genome: 'NC_000962' 00:00:00 Reading config file: /home/sark/snpEff/snpEff.config 00:00:01 done No sequence found in feature file. Trying fasta file '/home/sark/snpEff/./data/genomes/NC_000962.fa' Trying fasta file '/home/sark/snpEff/./data/NC_000962/sequences.fa' java.lang.RuntimeException: Cannot find sequence for 'NC_000962' at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.sequence(SnpEffPredictorFactoryFeatures.java:467) at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.addFeatures(SnpEffPredictorFactoryFeatures.java:111) at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:330) at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369) at org.snpeff.SnpEff.run(SnpEff.java:1183) at org.snpeff.SnpEff.main(SnpEff.java:162) java.lang.RuntimeException: Error reading file '/home/sark/snpEff/./data/NC_000962/genes.gbk' java.lang.RuntimeException: Cannot find sequence for 'NC_000962' at org.snpeff.snpEffect.factory.SnpEffPredictorFactoryFeatures.create(SnpEffPredictorFactoryFeatures.java:344) at org.snpeff.snpEffect.commandLine.SnpEffCmdBuild.run(SnpEffCmdBuild.java:369) at org.snpeff.SnpEff.run(SnpEff.java:1183) at org.snpeff.SnpEff.main(SnpEff.java:162) 00:00:01 Logging 00:00:02 Checking for updates... 00:00:04 Done.

Then i kept my fasta file in the above mentioned error folder but now it is giving the following error:

Downloading genome NC_000962.3 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 17.7M 0 17.7M 0 0 332k 0 --:--:-- 0:00:54 --:--:-- 447k curl: (16) Error in the HTTP2 framing layer

Then i thought of using the built in db for MTB so i just renamed my chr names in my file it is: M.tuberculosis_H37Rv And i tried to replace it with the built in one: ERS007734SCcontig000001 Still no success.

It is generating the following error in each variant of the vcf file:

9;ANN=A||MODIFIER|||||||||||||ERROR_OUT_OF_CHROMOSOME_RANGE

Can you please help me with this?

Thank you. :)

pcingola commented 4 years ago

Closing old issues.

mbhall88 commented 1 year ago

There is Mycobacterium_tuberculosis_h37rv which is the genome you want. However, the chromosome is called Chromosome so you'll either need to rename the chromosome in your VCF or fiddle with the snpEff config