pcingola / SnpEff

Other
243 stars 78 forks source link

Building database from Genebank file #263

Closed xiaolinchu92 closed 3 years ago

xiaolinchu92 commented 3 years ago

Hi,

I would like to use snpEff to annotate the variations in to synonymous non-synonymous mutations and to see the animo acid change caused by the mutations. I built the Pseudomonas phage phi-2 (https://www.ncbi.nlm.nih.gov/nuccore/281306659/) following the online instructions: https://pcingola.github.io/SnpEff/se_buildingdb/#option-4-building-a-database-from-genbank-files.

The .gbk file was download from ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/pseudomonas_phage_phi_2_uid42717 and put it in the snpEff/data/NC_013638.1/genes.gbk

The following information was added to snpEff.config file image

Then, I created database with the following line: java -jar snpEff.jar build -genbank -v NC_013638.1

A .bin file was created: snpEff/data/NC_013638.1/snpEffectPredictor.bin with the following information

java -jar snpEff.jar build -genbank -v NC_013638.1 00:00:00 SnpEff version SnpEff 5.0 (build 2020-10-04 16:02), by Pablo Cingolani 00:00:00 Command: 'build' 00:00:00 Building database for 'NC_013638.1' 00:00:00 Reading configuration file 'snpEff.config'. Genome: 'NC_013638.1' 00:00:00 Reading config file: /data/home/chuxl/software/snpEff/snpEff.config 00:00:02 done Chromosome: 'NC_013638.1 GI:281306659' length: 43144

    Create exons from CDS (if needed): ...........................................
    Exons created for 43 transcripts.

    Deleting redundant exons (if needed):
            Total transcripts with deleted exons: 0

    Collapsing zero length introns (if needed):
            Total collapsed transcripts: 0
            Adding genomic sequences to exons:      Done (43 sequences added, 0 ignored).

    Adjusting transcripts:
    Adjusting genes:
    Adjusting chromosomes lengths:
    Ranking exons:
    Create UTRs from CDS (if needed):
    Remove empty chromosomes:
            Removing empty chromosome: 'NC_013638.1'
            Chromosome left: NC_013638.1  GI:281306659

    Marking as 'coding' from CDS information:
    Done: 0 transcripts marked

00:00:02 Caracterizing exons by splicing (stage 1) :

00:00:02 Caracterizing exons by splicing (stage 2) : 00:00:02 done. 00:00:02 [Optional] Rare amino acid annotations 00:00:02 Warning: Cannot read optional protein sequence file '/data/home/chuxl/software/snpEff/./data/NC_013638.1/protein.fa', nothing done. 00:00:02 Protein check file: '/data/home/chuxl/software/snpEff/./data/NC_013638.1/genes.gbk'

00:00:02 Checking database using protein sequences 00:00:02 Comparing Proteins... Labels: '+' : OK '.' : Missing '*' : Error +++++++++++++++++++++++++++++++++++++++++++

    Protein check:  NC_013638.1     OK: 43  Not found: 0    Errors: 0       Error percentage: 0.0%

00:00:02 Saving database 00:00:03 [Optional] Reading regulation elements: GFF 00:00:03 Warning: Cannot read optional regulation file '/data/home/chuxl/software/snpEff/./data/NC_013638.1/regulation.gff', nothing done. 00:00:03 [Optional] Reading regulation elements: BED 00:00:03 Cannot find optional regulation dir '/data/home/chuxl/software/snpEff/./data/NC_013638.1/regulation.bed/', nothing done. 00:00:03 [Optional] Reading motifs: GFF 00:00:03 Warning: Cannot open PWMs file /data/home/chuxl/software/snpEff/./data/NC_013638.1/pwms.bin. Nothing done 00:00:03 Done 00:00:03 Logging 00:00:04 Checking for updates... 00:00:07 Done.

However, when I annotate the vcf file with java -Xmx10g -jar snpEff.jar -v NC_013638.1 -c snpEff.config -i vcf AA1_bwa.filt.vcf > AA1.vari.vcf, I got Errors: image OR image

The log lines are:

00:00:00 SnpEff version SnpEff 5.0 (build 2020-10-04 16:02), by Pablo Cingolani 00:00:00 Command: 'ann' 00:00:00 Reading configuration file 'snpEff.config'. Genome: 'NC_013638.1' 00:00:00 Reading config file: /data/home/chuxl/software/snpEff/snpEff.config 00:00:02 done 00:00:02 Reading database for genome version 'NC_013638.1' from file '/data/home/chuxl/software/snpEff/./data/NC_013638.1/snpEffectPredictor.bin' (this might take a while) 00:00:02 done 00:00:02 Loading Motifs and PWMs 00:00:02 Building interval forest 00:00:02 done. 00:00:02 Genome stats :

-----------------------------------------------

Genome name : 'phagephi2'

Genome version : 'NC_013638.1'

Genome ID : 'NC_013638.1[0]'

Has protein coding info : true

Has Tr. Support Level info : true

Genes : 43

Protein coding genes : 43

-----------------------------------------------

Transcripts : 43

Avg. transcripts per gene : 1.00

TSL transcripts : 0

-----------------------------------------------

Checked transcripts :

AA sequences : 43 ( 100.00% )

DNA sequences : 0 ( 0.00% )

-----------------------------------------------

Protein coding transcripts : 43

Length errors : 0 ( 0.00% )

STOP codons in CDS errors : 0 ( 0.00% )

START codon errors : 0 ( 0.00% )

STOP codon warnings : 0 ( 0.00% )

UTR sequences : 0 ( 0.00% )

Total Errors : 0 ( 0.00% )

WARNING : No protein coding transcript has UTR

-----------------------------------------------

Cds : 43

Exons : 43

Exons with sequence : 43

Exons without sequence : 0

Avg. exons per transcript : 1.00

WARNING : No mitochondrion chromosome found

-----------------------------------------------

Number of chromosomes : 2

Chromosomes : Format 'chromo_name size codon_table'

'NC_013638.1 GI:281306659' 43144 Bacterial_and_Plant_Plastid

'NC_013638.1' 1 Bacterial_and_Plant_Plastid

-----------------------------------------------

00:00:02 Predicting variants

ERRORS: Some errors were detected Error type Number of errors ERROR_CHROMOSOME_NOT_FOUND 5

00:00:02 Creating summary file: snpEff_summary.html 00:00:03 Creating genes file: snpEff_genes.txt 00:00:03 done. 00:00:03 Logging 00:00:04 Checking for updates... 00:00:07 Done.

So basically, nothing was annotated here. All the information about the reference sequence I have is from ncbi link. How can I get the annotation details of variations?

Thanks, Xiaolin

pcingola commented 3 years ago

https://github.com/pcingola/SnpEff/wiki/ERROR_CHROMOSOME_NOT_FOUND