hariszaf / pema

PEMA: a flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S rRNA, ITS and COI marker genes
27 stars 12 forks source link

MIDORI updates #56

Open kmexter opened 1 year ago

kmexter commented 1 year ago

According to recent emails with the MIDORI developers, it seems wise to update PEMA to where the midori db is now published. Hopefully this will solve a couple of issues that we have had (1) the gaps in the taxonomic classification output when there are missing taxon nodes (2) some were errors and discrepancies in the classifications wrt NCBI

Copy of the emails (latest to first):

Sorry to say that we are no more updating the databases in "MIDORI server”. We are updating only databased you can download from here : http://www.reference-midori.info/download.php#

Hi Christina, Thank you for your email. I think PEMA is using old MIDORI database. I have fixed this problem quite long time ago. In all formats, except RAW files, we have inserted missing taxonomy by creating it from a lower taxonomic ranking (ex. description in class-level was missing, so it was created from order-level in the following example, >JF502242.1.7041.7724 root_1;Eukaryota_2759;Chordata_7711;class_Crocodylia_1294634;Crocodylia_1294634;Crocodylidae_8493;Crocodylus_8500;Crocodylus intermedius_184240). Will it be possible that you download recent databases from our cite and locally perform the taxonomic assignment? We are using NCBI taxonomy for all MIDORI databases. I think those inconsistency is happening because PEMA is using old database (NCBI taxonomy has been consistently revised). If you have further questions, please write me back again. Best regards, Ryuji

Dear Dr Machida, My name is Christina Pavloudi and I am a Post Doctoral Researcher at the CNRS. In my previouds Post Doc position, I was working for the ARMS-MBON project (my colleagues are in CC), where we were sequencing ARMS samples for COI (among other genes) and we were using PEMA for the analyses of the results. PEMA is using MIDORI for the taxonomic assignment of COI reads, hence I am contacting you regarding an issue we came across. At the moment, the MIDORI output does not always have the same number of columns, i.e. the same number of taxonomic levels, for all the assignments. You can see an example in the the attached file ("Example_species_notall.tsv") For some assignments, the output has all the 8 levels: root, superkingdom, phylum, class, order, family, genus, species (see attached file "Example_species_alllevels.tsv"). It would be extremely helpful, in terms of FAIRness for the ARMS-MBON project, if the MIDORI output was consistent and always contained the 8 levels, even if some columns were empty (see attached "Example_species_emptylevels.tsv"). Do you perhaps consider doing something like this for future versions of MIDORI? Also, could I ask which taxonomy you are using in MIDORI? Because, as you can see in "Example_species_emptylevels_completed.tsv", for some of the species in question the missing taxonomic levels do exist (if we check at the WoRMS, but also at the NCBI Taxonomy). Also, some of them are different from the output that is produced by MIDORI.

hariszaf commented 1 year ago

Steps

  1. Make sure we can use the MIDORI2_LONGEST_NUC_GB255_CO1_RDP.fasta from MIDORI 2 that's based on the GenBank 255. This file has header lines, starting with (>) and they include the taxonomyL root_1;Eukaryota_2759;Discosea_555280;Flabellinia_1485085;order_Vannellidae_95227;Vannellidae_95227;Vannella_95228;Vannella danica_703018 The number after each _ is the NCBI Taxonomy id of the corresponding taxonomic level.

  2. Once you make sure which file to use, then you need to train the RDPClassifier. To do so, you need to follow the instructions you ll find here.

kmexter commented 1 year ago

Note that the output file format (the finalTable.tsv and the extendedFinalTable) will change as a consequence: this will need to be looked at since the same table is output when other reference DBs are used (UNITE and Silva), and we don't want a different output format just because some of the internal parameters change. Once this update has been done, therefore,
@kmexter, @cpavloud, and @hariszaf can help look at the results and figure out how to create the best finalTable and extendedFinalTables (as well as perhaps a few other output files)

There will also be some interplay between this issue and https://github.com/hariszaf/pema/issues/52 https://github.com/hariszaf/pema/issues/29, so these should all be considered together before any work starts