Open cpavloud opened 2 years ago
@cpavloud I found out about the ncbi-taxonomist tool.
We could use it I think.
Would you like to have a look and share any thoughts?
I am not sure how it would work exactly (the ncbi-taxonomist page does not provide very good examples/explanations), but we could give it a try.
Think of a while loop that will start from the end of the taxonomy in each row of the finalTable.tsv
file and will use the ncbi-taxonomist
for each level.
Using each level, we ll do queries searching for an ncbi taxonomy id, and when we have one we ll have something like this:
Assiming we are looking for Saprospiraceae
ncbi-taxonomist collect -n 'Saprospiraceae'
would return:
{"taxid":131567,"rank":"no rank","names":{"cellular organisms":"scientific_name"},"parentid":null,"name":"cellular organisms"}
{"taxid":2,"rank":"superkingdom","names":{"Bacteria":"scientific_name"},"parentid":131567,"name":"Bacteria"}
{"taxid":1783270,"rank":"clade","names":{"FCB group":"scientific_name"},"parentid":2,"name":"FCB group"}
{"taxid":68336,"rank":"clade","names":{"Bacteroidetes/Chlorobi group":"scientific_name"},"parentid":1783270,"name":"Bacteroidetes/Chlorobi group"}
{"taxid":976,"rank":"phylum","names":{"Bacteroidetes":"scientific_name"},"parentid":68336,"name":"Bacteroidetes"}
{"taxid":1937959,"rank":"class","names":{"Saprospiria":"scientific_name"},"parentid":976,"name":"Saprospiria"}
{"taxid":1936988,"rank":"order","names":{"Saprospirales":"scientific_name"},"parentid":1937959,"name":"Saprospirales"}
{"taxid":89374,"rank":"family","names":{"Saprospiraceae":"scientific_name","Saprospira group":"Synonym"},"parentid":1936988,"name":"Saprospiraceae"}
So, for example, if you have this classifications in the finalTable.tsv
Main genome;Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Dipodascaceae;Geotrichum
you would search for Geotrichum
and then for Dipodascaceae
and then for Saccharomycetales
etc etc.
and get the last line for each of your searches?
I would search for Geotrichum
, if that has a hit, i d get
If I would not get a hit, I would continue with Dipodascaceae
etc.
@cpavloud have a look. would that be ok ?
root@3bbfa77ef486:/mnt/analysis# more extenedFinalTable.tsv
OTU ERR0000001 Classification TAXON:NCBI_TAX_ID
Otu4056 1 Main genome;Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales Patescibacteria:1783273
@cpavloud have a look. would that be ok ?
root@3bbfa77ef486:/mnt/analysis# more extenedFinalTable.tsv OTU ERR0000001 Classification TAXON:NCBI_TAX_ID Otu4056 1 Main genome;Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales Patescibacteria:1783273
If there were no NCBI taxonomy IDs for Saccharimonadia
and Saccharimonadales
, I think we are fine :)
Exactly! The thing is that there is not a ncbi taxonomy id always for a name in a ref db. So i thought we could go up to the taxonomy found and work at one rank at a time starting from the species level. I ll add this asap.
Just fyi, here is what you would get if you d search on ncbi taxonomy db for Saccharimonadales
and Saccharimonadia
This feature is now ready and will be part of pema:v.2.1.4
.
The issue is now resolved.
Re-opening the issue: In case it might be helpful, we can go from the sequence accession number to the NCBI Id: https://www.biostars.org/p/10959/
This is definitely useful for ITS #52
One think that has been requested is to enhance the final_table.tsv file to include (apart from the columns it already includes), the NCBI Taxon ID for each ASV/OTU and the accession number of the sequence that was its closest match in the database used. The NCBI Taxon ID could then be used as the taxonConceptID when submitting data to GBIF/OBIS using the DwC-A format (as discussed here)
For example, instead of the current final_table.tsv file, which looks like this OTU_id,ERR0000008,ERR0000009,Classification Otu1,1123,2,Eukaryota;Arthropoda;Insecta;Plecoptera;Capniidae;Allocapnia;Allocapnia aurora Otu2,3,0,Eukaryota;Porifera;Demospongiae;Hadromerida;Polymastiidae;Polymastia;Polymastia littoralis
(Ideally) It could be something like this OTU_id,ERR0000008,ERR0000009,Classification,Accession_number,NCBI_Taxon_ID Otu1,1123,2,Eukaryota;Arthropoda;Insecta;Plecoptera;Capniidae;Allocapnia;Allocapnia aurora,JN200445,608846 Otu2,3,0,Eukaryota;Porifera;Demospongiae;Hadromerida;Polymastiidae;Polymastia;Polymastia littoralis,NC_023834,1473587
If it is not possible to retrieve the accession number and/or the NCBI taxon ID, I think we can find some workarounds. Perhaps it will be possible to retrieve the NCBI Taxon ID using the Bio.Entrez package