Open peterjc opened 7 years ago
Hi Peter, A few months ago they blocked high level Taxa. They want you to use more specific taxa apparently. For completely new species theres a chicken and egg problem. In the olden days every assembly got a new taxon ID (which is why there are nearly 2 million). However now NCBI (who assign taxon IDs) demand a publication before they will grant one, so you have to use a temporary taxa, then update later. Its quite convolted.
As for strain, we submit using their API interface, so we have to provide a header in the embl, which then gets overwritten with whatever metadata is in the BioSample. Its possible they have moved the goal posts again in the week since we last submitted data.....
Ah. My hunch was right, and yes - this is exactly the chicken-and-egg situation I am facing.
Could you elaborate on what you meant by using a temporary taxa?
See https://github.com/enasequence/sequencetools/issues/15
This error turned out to be with the validator's internal settings:
ERROR: Scientific_name "Serratia sp." is not submittable. (MasterEntrySourceCheck_2)
However, to avoid this error I currently need to manually edit the source feature in my EMBL file:
ERROR: At least one of the following qualifiers "strain, environmental_sample, isolate" must exist when organism belongs to Bacteria. (OrganismAndRequiredQualifierCheck)
Perhaps for people like me using the ENA webin (web interface), rather than the API, there needs to be an extra set of options on gff3_to_embl
to record the strain, environmental sample or isolate fields?
[Update: Human error, see below - I was not giving the full organism name to gff3_to_embl
]
(I've not actually submitted this new sequence yet - but I intend to try using the genus level taxid as before)
Hi Peter, I cant replicate your error from the latest version of the validator. Using the following EMBL file, it validates fine (without a strain/ environmental_sample, isolate). Might be another issue somewhere?
ID XXX; XXX; circular; genomic DNA; STD; PROK; 240 BP.
XX
AC XXX;
XX
AC * _ERS111111SCcontig000001
XX
PR Project:PRJEB1111;
XX
DE XXX;
XX
RN [1]
RA Pathogen Genomics;
RT "Draft assembly annotated with Prokka";
RL Submitted (24-Nov-2016) to the INSDC.
XX
FH Key Location/Qualifiers
FH
FT source 1..240
FT /organism="Staphylococcus aureus"
FT /mol_type="genomic DNA"
FT /db_xref="taxon:1280"
FT /note="ERS11111|SC|contig000001"
FT tRNA 143..218
FT /product="tRNA-Val(tac)"
FT /inference="COORDINATES:profile:Aragorn:1.2.36"
FT /locus_tag="SAMEA1111111_00001"
SQ Sequence 240 BP; 60 A; 60 C; 60 G; 60 T; 0 other;
aatctacatt catatgtctg gtgactatag caaggaggtc acacctgttc ccatgccgaa 60
cacagaagtt aagctcctta gcgtcgatgg tagttggact tacgttccgc tagagtagaa 120
cgttgccagg caatgataaa tcggagaatt agctcagctg ggagagcatc tgccttacaa 180
gcagagggtc ggcggttcga acccgtcatt ctccaccatt tattcttaca tattgccggc 240
//
If you could edit your example above on GitHub to wrap it in triple back-ticks, GitHub will render it as a code block, and preserve the white space (so I can copy and paste it for testing here).
I suspect the key difference is your example has a taxid for a full species name, Staphylococcus aureus taxon 1280.
What happens if you change the example to pretend you have a new species/strain without a pre-existing taxon id, say Staphylococcus sp. XYZ, and try either taxon 1279 (Staphylococcus) or 29387 (Staphylococcus sp.)?
Heres the file (as a file). example_embl.txt
So the genus taxon 1279 (Staphylococcus) gets through the validator, but you'll get an email in a few days/weeks informing you that the 'computer says NO'.
Confirmed using embl-api-validator-1.1.150.jar
. Likewise using taxon 613 and Serratia sp. XYZ passes validation:
FT source 1..240
FT /organism="Serratia sp. XYZ"
FT /mol_type="genomic DNA"
FT /db_xref="taxon:613"
This was my problematic version:
FT source 1..5090820
FT /organism="Serratia sp."
FT /mol_type="genomic DNA"
FT /db_xref="taxon:613"
I can pass validation by adding /strain="XYZ"
(as mentioned above) or more simply by giving the full organism name in as /organism="Serratia sp. XYZ"
. With hindsight this seems obvious, your example was very helpful, thank you.
So there were at least two problems: I was not telling gff3_to_embl
the full organism name, and the version of the validator I was using was (wrongly) being too strict.
I hope to submit this week, anticipating a query back about this being a novel species without a taxon ID. I will report back later with an update for future readers of this issue. Thanks!
Good luck with your submission!
Update on the ENA side of interest: http://listserver.ebi.ac.uk/pipermail/ena-announce/2017-January/000165.html
Thanks
(Some months back I did this successfully to submit a new strain from a different genus, so while I might be doing something wrong/different, I suspect the ENA validator has become stricter in the meantime)
For an un-named Serratia which does not (yet) have a unique NCBI taxonomy entry - the parent would be
Serratia
, taxid 613,https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=613&lvl=3&lin=f&keep=1&srchmode=1&unlock
I have tried that, and the entry
Serratia sp.
, taxid 616https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=616&lvl=5&lin=f&keep=1&srchmode=1&unlock
Either taxid approach fails validation:
Here line 17 was the
source
feature. Manually editing the EMBL file to add astrain
qualifier to the feature worked for me, but what exactly it wants for species name eludes me.Am I missing something simple?
[Update: Yes, I was not giving the full organism name to
gff3_to_embl
, but also there was a problem with this version of the validator]Should
gff3_to_embl
have options for inserting source feature qualifiers "strain, environmental_sample, isolate" (or should I have done this in prokka)?Thanks!