pcingola / SnpEff

Other
257 stars 81 forks source link

version 5.1 fails to build data from genbank whereas the same command in 5.0 works #388

Closed ialbert closed 2 years ago

ialbert commented 2 years ago

Describe the bug Version 5.1 of snpEff is unable to build a database from a genbank file whereas 5.0 worked.

To Reproduce

# Create the data folder
mkdir -p data/AF086833

# Fetch the genbank file
efetch -db nuccore -id AF086833 -format gb > data/AF086833/genes.gbk

# Create the config file.
echo "AF086833.genome : AF086833" > snpEff.config

# snpEff v5.0e (works)
snpEff build -genbank -v AF086833

# snpEff v5.1 (fails)
snpEff build -genbank -v AF086833

the process fails with an error.

In addition when I run with 5.1 I am getting lots of error messages of the form:

...
WARNING_GENE_COORDINATES: Gene 'Gene_8287_9739' (name:'VP30'), adjusting start coordinate from 8287 to 8508
WARNING_GENE_COORDINATES: Gene 'Gene_8287_9739' (name:'VP30'), adjusting end coordinate from 9739 to 9374
...

that do not appear in 5.0.

Additional context

In a related note, I will also say that the documentation for using GenBank files seems to be incorrect.

http://pcingola.github.io/SnpEff/se_buildingdb/#option-4-building-a-database-from-genbank-files

for example the command in the docs instructs the user to use -v CP000730 as id, but the config file for that same example does not have that id.

deborahmleigh commented 2 years ago

Yes I am also having the same issue, I have gone through all the steps with multiple genomes and several debug threads but build is ending with:

00:01:25 Caracterizing exons by splicing (stage 1) : .................................................................................................... 100000 .................................................................................................... 200000 .................................................................................................... 300000 ... 00:01:25 Caracterizing exons by splicing (stage 2) : .................................................................................................... 100000 .................................................................................................... 200000 .................................................................................................... 300000 ...00:01:26 done. 00:01:26 [Optional] Rare amino acid annotations WARNING_FILE_NOT_FOUND: Rare Amino Acid analysis: Cannot read protein sequence file '/home/leigh/software.Mac/snpEff/data/Fagus_sylvatica.1.3/protein.fa', nothing done. ERROR: CDS check file '/home/leigh/software.Mac/snpEff/data/Fagus_sylvatica.1.3/cds.fa' not found. ERROR: Protein check file '/home/leigh/software.Mac/snpEff/data/Fagus_sylvatica.1.3/protein.fa' not found. ERROR: Database check failed. 00:01:26 Logging 00:01:27 Checking for updates... 00:01:28 Done.

pcingola commented 2 years ago

There are a few things happening here:

1) From version 5.1 onwards, SnpEFf will refuse to save a database that has more than 5% of errors, when comparing CDS ot Protein sequences.

2) The Genbank file for AF086833 has an "error" in the definition of one of the CDS, so there is one error out of the 9 proteins. Since 1 / 9 = 11.1% is larger than 5%, SnpEFf refuses to save the database and exits with an error.

Details of the problem

As it happens with other Ebola viruses, there is a Ribosomal Slippage in "GP" (for instance see https://www.ncbi.nlm.nih.gov/nuccore/KJ660346.2)

Unfortunately, the CDS definition is (line 168 from https://www.ncbi.nlm.nih.gov/nuccore/10141003) was not annotated pro perly:

     CDS             join(6039..6923,6923..8068)
                     /gene="GP"
                     /function="receptor binding and fusion"
                     /note="an addition A residue is inserted during
                     transcription; encodes two disulfide linked subunits GP1
                     and GP2"
....

So, this entry is missing the /ribosomal_slippage annotation. If you add that to line 170, everything works OK:

     CDS             join(6039..6923,6923..8068)
                     /gene="GP"
                     /ribosomal_slippage
                     /function="receptor binding and fusion"
                     /note="an addition A residue is inserted during
                     transcription; encodes two disulfide linked subunits GP1
                     and GP2"
                     /citation=[2]
                     /citation=[3]
                     /citation=[4]
                     /codon_start=1
                     /product="virion spike glycoprotein precursor"
                     /protein_id="AAD14585.1"
                     /translation="MGVTGILQLPRDRFKRTSFFLWVIILFQRTFSIPLGVIHNSTLQ
                     VSDVDKLVCRDKLSSTNQLRSVGLNLEGNGVATDVPSATKRWGFRSGVPPKVVNYEAG
                     EWAENCYNLEIKKPDGSECLPAAPDGIRGFPRCRYVHKVSGTGPCAGDFAFHKEGAFF
                     LYDRLASTVIYRGTTFAEGVVAFLILPQAKKDFFSSHPLREPVNATEDPSSGYYSTTI
                     RYQATGFGTNETEYLFEVDNLTYVQLESRFTPQFLLQLNETIYTSGKRSNTTGKLIWK
                     VNPEIDTTIGEWAFWETKKNLTRKIRSEELSFTVVSNGAKNISGQSPARTSSDPGTNT
                     TTEDHKIMASENSSAMVQVHSQGREAAVSHLTTLATISTSPQSLTTKPGPDNSTHNTP
                     VYKLDISEATQVEQHHRRTDNDSTASDTPSATTAAGPPKAENTNTSKSTDFLDPATTT
                     SPQNHSETAGNNNTHHQDTGEESASSGKLGLITNTIAGVAGLITGGRRTRREAIVNAQ
                     PKCNPNLHYWTTQDEGAAIGLAWIPYFGPAAEGIYIEGLMHNQDGLICGLRQLANETT
                     QALQLFLRATTELRTFSILNRKAIDFLLQRWGGTCHILGPDCCIEPHDWTKNITDKID
                     QIIHDFVDKTLPDQGDNDNWWTGWRQWIPAGIGVTGVIIAVIALFCICKFVF"

Here is the output of my build command (make sure you use the latest version "5.1 D"):

$ snpeff build -v AF086833
00:00:00 SnpEff version SnpEff 5.1d (build 2022-04-19 15:49), by Pablo Cingolani
00:00:00 Command: 'build'
00:00:00 Building database for 'AF086833'
00:00:00 Reading configuration file 'snpEff.config'. Genome: 'AF086833'
...

00:00:00 Chromosome: 'AF086833.2'   length: 18959
00:00:00 Create exons from CDS (if needed):
...
00:00:01 CDS check: GenBank file format, skipping

00:00:01 Protein check file: './data/AF086833/genes.gbk'

00:00:01 Checking database using protein sequences
00:00:01 Comparing Proteins...
    Labels:
        '+' : OK
        '.' : Missing
        '*' : Error
    +++++++++

    Protein check:  AF086833    OK: 9   Not found: 0    Errors: 0   Error percentage: 0.0%
00:00:01 Saving database
00:00:01 Saving sequences for small chromosmes to file './data/AF086833/sequence.bin'
pcingola commented 2 years ago

I've uploaded the database, you can download it:

$ snpeff download -v AF086833
00:00:00 SnpEff version SnpEff 5.1d (build 2022-04-19 15:49), by Pablo Cingolani
00:00:00 Command: 'download'
00:00:00 Reading configuration file 'snpEff.config'. Genome: 'AF086833'
00:00:00 done
00:00:00 Downloading database for 'AF086833'
00:00:00 Downloading from 'https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_AF086833.zip' to local file '/var/folders/s9/y0bgs3l55rj_jkkkxr2drz4157r1dz/T//snpEff_v5_1_AF086833.zip'
00:00:00 Connecting to https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_AF086833.zip
...
00:00:01 Download finished. Total 12027 bytes.
00:00:01 Extracting file 'data/AF086833/sequence.bin'
00:00:01 Extracting file 'data/AF086833/snpEffectPredictor.bin'
00:00:01 Unzip: OK
...
00:00:02 Done.
ialbert commented 2 years ago

thanks for tracking down the problem. It is pretty mindboggling that errors like the above exist and are unfixed for decades.

humbleflowers commented 5 months ago

Hello @pcingola

I am facing the same issue with this new version of Candida auris genome (April 2024) https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_002759435.3/

Please can you let me know fix and help me build it as well?

Thank you.