hariszaf / pema

PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S rRNA, ITS and COI marker genes
27 stars 12 forks source link

provide pema main data product in a 7-level taxonomy format #52

Open hariszaf opened 1 year ago

hariszaf commented 1 year ago

It would be super useful to return the pema main output (otu/asv table) in a 7-level taxonomy format, meaning all taxonomy assignments are as:

d__Bacteria; p__Abyssubacteria; c__SURF-5; o__SURF-5; f__SURF-5; g__SURF-5; s__SURF-5 sp003598085
hariszaf commented 1 year ago

Regarding COI, this is now covered under #56 --> the outputs are already in the required 7-levels.

Regarding 16S, we still wait for the Silva update. However, we have been waiting for a while and are getting a bit fed-up with waiting, hence it would be useful to do this ourselves. For advice on how to do this (and if it is feasible), consult with @hariszaf and @cpavloud

hariszaf commented 1 year ago

Regarding the ITS gene and the Unite database: one thing you could do is to get the General FASTA release (download) file and from there get the sequences id.

For example: >Glomeraceae|AM076560|SH146432.05FU|refs|k__Fungi;p__Glomeromycota;c__Glomeromycetes;o__Glomerales;f__Glomeraceae;g__;s__uncultured_Glomus

The AM076560 is the sequence id.

Using that, you can get from the NCBI the organism it comes from https://www.ncbi.nlm.nih.gov/nuccore/AM076560 and therefore, its NCBI taxonomy id.

kmexter commented 1 year ago

may be some interplay with https://github.com/hariszaf/pema/issues/29 here