Closed ialbert closed 2 years ago
Yes I am also having the same issue, I have gone through all the steps with multiple genomes and several debug threads but build is ending with:
00:01:25 Caracterizing exons by splicing (stage 1) : .................................................................................................... 100000 .................................................................................................... 200000 .................................................................................................... 300000 ... 00:01:25 Caracterizing exons by splicing (stage 2) : .................................................................................................... 100000 .................................................................................................... 200000 .................................................................................................... 300000 ...00:01:26 done. 00:01:26 [Optional] Rare amino acid annotations WARNING_FILE_NOT_FOUND: Rare Amino Acid analysis: Cannot read protein sequence file '/home/leigh/software.Mac/snpEff/data/Fagus_sylvatica.1.3/protein.fa', nothing done. ERROR: CDS check file '/home/leigh/software.Mac/snpEff/data/Fagus_sylvatica.1.3/cds.fa' not found. ERROR: Protein check file '/home/leigh/software.Mac/snpEff/data/Fagus_sylvatica.1.3/protein.fa' not found. ERROR: Database check failed. 00:01:26 Logging 00:01:27 Checking for updates... 00:01:28 Done.
There are a few things happening here:
1) From version 5.1 onwards, SnpEFf will refuse to save a database that has more than 5% of errors, when comparing CDS ot Protein sequences.
2) The Genbank file for AF086833 has an "error" in the definition of one of the CDS, so there is one error out of the 9 proteins. Since 1 / 9 = 11.1% is larger than 5%, SnpEFf refuses to save the database and exits with an error.
As it happens with other Ebola viruses, there is a Ribosomal Slippage in "GP" (for instance see https://www.ncbi.nlm.nih.gov/nuccore/KJ660346.2)
Unfortunately, the CDS definition is (line 168 from https://www.ncbi.nlm.nih.gov/nuccore/10141003) was not annotated pro perly:
CDS join(6039..6923,6923..8068)
/gene="GP"
/function="receptor binding and fusion"
/note="an addition A residue is inserted during
transcription; encodes two disulfide linked subunits GP1
and GP2"
....
So, this entry is missing the /ribosomal_slippage
annotation. If you add that to line 170, everything works OK:
CDS join(6039..6923,6923..8068)
/gene="GP"
/ribosomal_slippage
/function="receptor binding and fusion"
/note="an addition A residue is inserted during
transcription; encodes two disulfide linked subunits GP1
and GP2"
/citation=[2]
/citation=[3]
/citation=[4]
/codon_start=1
/product="virion spike glycoprotein precursor"
/protein_id="AAD14585.1"
/translation="MGVTGILQLPRDRFKRTSFFLWVIILFQRTFSIPLGVIHNSTLQ
VSDVDKLVCRDKLSSTNQLRSVGLNLEGNGVATDVPSATKRWGFRSGVPPKVVNYEAG
EWAENCYNLEIKKPDGSECLPAAPDGIRGFPRCRYVHKVSGTGPCAGDFAFHKEGAFF
LYDRLASTVIYRGTTFAEGVVAFLILPQAKKDFFSSHPLREPVNATEDPSSGYYSTTI
RYQATGFGTNETEYLFEVDNLTYVQLESRFTPQFLLQLNETIYTSGKRSNTTGKLIWK
VNPEIDTTIGEWAFWETKKNLTRKIRSEELSFTVVSNGAKNISGQSPARTSSDPGTNT
TTEDHKIMASENSSAMVQVHSQGREAAVSHLTTLATISTSPQSLTTKPGPDNSTHNTP
VYKLDISEATQVEQHHRRTDNDSTASDTPSATTAAGPPKAENTNTSKSTDFLDPATTT
SPQNHSETAGNNNTHHQDTGEESASSGKLGLITNTIAGVAGLITGGRRTRREAIVNAQ
PKCNPNLHYWTTQDEGAAIGLAWIPYFGPAAEGIYIEGLMHNQDGLICGLRQLANETT
QALQLFLRATTELRTFSILNRKAIDFLLQRWGGTCHILGPDCCIEPHDWTKNITDKID
QIIHDFVDKTLPDQGDNDNWWTGWRQWIPAGIGVTGVIIAVIALFCICKFVF"
Here is the output of my build command (make sure you use the latest version "5.1 D"):
$ snpeff build -v AF086833
00:00:00 SnpEff version SnpEff 5.1d (build 2022-04-19 15:49), by Pablo Cingolani
00:00:00 Command: 'build'
00:00:00 Building database for 'AF086833'
00:00:00 Reading configuration file 'snpEff.config'. Genome: 'AF086833'
...
00:00:00 Chromosome: 'AF086833.2' length: 18959
00:00:00 Create exons from CDS (if needed):
...
00:00:01 CDS check: GenBank file format, skipping
00:00:01 Protein check file: './data/AF086833/genes.gbk'
00:00:01 Checking database using protein sequences
00:00:01 Comparing Proteins...
Labels:
'+' : OK
'.' : Missing
'*' : Error
+++++++++
Protein check: AF086833 OK: 9 Not found: 0 Errors: 0 Error percentage: 0.0%
00:00:01 Saving database
00:00:01 Saving sequences for small chromosmes to file './data/AF086833/sequence.bin'
I've uploaded the database, you can download it:
$ snpeff download -v AF086833
00:00:00 SnpEff version SnpEff 5.1d (build 2022-04-19 15:49), by Pablo Cingolani
00:00:00 Command: 'download'
00:00:00 Reading configuration file 'snpEff.config'. Genome: 'AF086833'
00:00:00 done
00:00:00 Downloading database for 'AF086833'
00:00:00 Downloading from 'https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_AF086833.zip' to local file '/var/folders/s9/y0bgs3l55rj_jkkkxr2drz4157r1dz/T//snpEff_v5_1_AF086833.zip'
00:00:00 Connecting to https://snpeff.blob.core.windows.net/databases/v5_1/snpEff_v5_1_AF086833.zip
...
00:00:01 Download finished. Total 12027 bytes.
00:00:01 Extracting file 'data/AF086833/sequence.bin'
00:00:01 Extracting file 'data/AF086833/snpEffectPredictor.bin'
00:00:01 Unzip: OK
...
00:00:02 Done.
thanks for tracking down the problem. It is pretty mindboggling that errors like the above exist and are unfixed for decades.
Hello @pcingola
I am facing the same issue with this new version of Candida auris genome (April 2024) https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_002759435.3/
Please can you let me know fix and help me build it as well?
Thank you.
Describe the bug Version 5.1 of snpEff is unable to build a database from a genbank file whereas 5.0 worked.
To Reproduce
the process fails with an error.
In addition when I run with 5.1 I am getting lots of error messages of the form:
that do not appear in 5.0.
Additional context
In a related note, I will also say that the documentation for using GenBank files seems to be incorrect.
http://pcingola.github.io/SnpEff/se_buildingdb/#option-4-building-a-database-from-genbank-files
for example the command in the docs instructs the user to use
-v CP000730
as id, but the config file for that same example does not have that id.