hariszaf / pema

PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S rRNA, ITS and COI marker genes
27 stars 12 forks source link

NCBI Taxon ID included in the final_table.tsv file? #29

Open cpavloud opened 2 years ago

cpavloud commented 2 years ago

One think that has been requested is to enhance the final_table.tsv file to include (apart from the columns it already includes), the NCBI Taxon ID for each ASV/OTU and the accession number of the sequence that was its closest match in the database used. The NCBI Taxon ID could then be used as the taxonConceptID when submitting data to GBIF/OBIS using the DwC-A format (as discussed here)

For example, instead of the current final_table.tsv file, which looks like this OTU_id,ERR0000008,ERR0000009,Classification Otu1,1123,2,Eukaryota;Arthropoda;Insecta;Plecoptera;Capniidae;Allocapnia;Allocapnia aurora Otu2,3,0,Eukaryota;Porifera;Demospongiae;Hadromerida;Polymastiidae;Polymastia;Polymastia littoralis

(Ideally) It could be something like this OTU_id,ERR0000008,ERR0000009,Classification,Accession_number,NCBI_Taxon_ID Otu1,1123,2,Eukaryota;Arthropoda;Insecta;Plecoptera;Capniidae;Allocapnia;Allocapnia aurora,JN200445,608846 Otu2,3,0,Eukaryota;Porifera;Demospongiae;Hadromerida;Polymastiidae;Polymastia;Polymastia littoralis,NC_023834,1473587

If it is not possible to retrieve the accession number and/or the NCBI taxon ID, I think we can find some workarounds. Perhaps it will be possible to retrieve the NCBI Taxon ID using the Bio.Entrez package

hariszaf commented 2 years ago

@cpavloud I found out about the ncbi-taxonomist tool.

We could use it I think.

Would you like to have a look and share any thoughts?

cpavloud commented 2 years ago

I am not sure how it would work exactly (the ncbi-taxonomist page does not provide very good examples/explanations), but we could give it a try.

hariszaf commented 2 years ago

Think of a while loop that will start from the end of the taxonomy in each row of the finalTable.tsv file and will use the ncbi-taxonomist for each level. Using each level, we ll do queries searching for an ncbi taxonomy id, and when we have one we ll have something like this:

Assiming we are looking for Saprospiraceae

ncbi-taxonomist collect -n 'Saprospiraceae'

would return:

{"taxid":131567,"rank":"no rank","names":{"cellular organisms":"scientific_name"},"parentid":null,"name":"cellular organisms"}
{"taxid":2,"rank":"superkingdom","names":{"Bacteria":"scientific_name"},"parentid":131567,"name":"Bacteria"}
{"taxid":1783270,"rank":"clade","names":{"FCB group":"scientific_name"},"parentid":2,"name":"FCB group"}
{"taxid":68336,"rank":"clade","names":{"Bacteroidetes/Chlorobi group":"scientific_name"},"parentid":1783270,"name":"Bacteroidetes/Chlorobi group"}
{"taxid":976,"rank":"phylum","names":{"Bacteroidetes":"scientific_name"},"parentid":68336,"name":"Bacteroidetes"}
{"taxid":1937959,"rank":"class","names":{"Saprospiria":"scientific_name"},"parentid":976,"name":"Saprospiria"}
{"taxid":1936988,"rank":"order","names":{"Saprospirales":"scientific_name"},"parentid":1937959,"name":"Saprospirales"}
{"taxid":89374,"rank":"family","names":{"Saprospiraceae":"scientific_name","Saprospira group":"Synonym"},"parentid":1936988,"name":"Saprospiraceae"}
cpavloud commented 2 years ago

So, for example, if you have this classifications in the finalTable.tsv

Main genome;Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Dipodascaceae;Geotrichum

you would search for Geotrichum and then for Dipodascaceae and then for Saccharomycetales etc etc.

and get the last line for each of your searches?

hariszaf commented 2 years ago

I would search for Geotrichum, if that has a hit, i d get

If I would not get a hit, I would continue with Dipodascaceae etc.

hariszaf commented 2 years ago

@cpavloud have a look. would that be ok ?

root@3bbfa77ef486:/mnt/analysis# more extenedFinalTable.tsv 
OTU ERR0000001  Classification  TAXON:NCBI_TAX_ID
Otu4056 1   Main genome;Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales  Patescibacteria:1783273
cpavloud commented 2 years ago

@cpavloud have a look. would that be ok ?

root@3bbfa77ef486:/mnt/analysis# more extenedFinalTable.tsv 
OTU   ERR0000001  Classification  TAXON:NCBI_TAX_ID
Otu4056   1   Main genome;Bacteria;Patescibacteria;Saccharimonadia;Saccharimonadales  Patescibacteria:1783273

If there were no NCBI taxonomy IDs for Saccharimonadia and Saccharimonadales, I think we are fine :)

hariszaf commented 2 years ago

Exactly! The thing is that there is not a ncbi taxonomy id always for a name in a ref db. So i thought we could go up to the taxonomy found and work at one rank at a time starting from the species level. I ll add this asap.

hariszaf commented 2 years ago

Just fyi, here is what you would get if you d search on ncbi taxonomy db for Saccharimonadales

image

and Saccharimonadia

image

hariszaf commented 2 years ago

This feature is now ready and will be part of pema:v.2.1.4.

The issue is now resolved.

cpavloud commented 1 year ago

Re-opening the issue: In case it might be helpful, we can go from the sequence accession number to the NCBI Id: https://www.biostars.org/p/10959/

hariszaf commented 1 year ago

This is definitely useful for ITS #52