mhoban / rainbow_bridge

GNU General Public License v3.0
5 stars 2 forks source link

R taxonomy "dropped" vs. "", synthetic constructs #25

Closed cajwalsh closed 2 months ago

cajwalsh commented 1 year ago

The first time I used R taxonomy I noticed some different "dropped" patterns than the standard eDNAFlow typically provides. Here is a summary of taxonomic assignment at all levels with R taxonomy: 23 of 285 total zOTUs dropped at domain level 13 of 285 total zOTUs dropped at kingdom level 45 of 285 total zOTUs dropped at phylum level 40 of 285 total zOTUs dropped at class level 31 of 285 total zOTUs dropped at order level 33 of 285 total zOTUs dropped at family level 51 of 285 total zOTUs dropped at genus level 124 of 285 total zOTUs dropped at species level

vs. the same data with the python script: 0 of 256 total zOTUs dropped at domain level 3 of 256 total zOTUs dropped at kingdom level 16 of 256 total zOTUs dropped at phylum level 17 of 256 total zOTUs dropped at class level 17 of 256 total zOTUs dropped at order level 28 of 256 total zOTUs dropped at family level 50 of 256 total zOTUs dropped at genus level 99 of 256 total zOTUs dropped at species level

The first thing I noticed was the greater number of zOTUs, yet the majority of them dropped at domain. I found in the blast results file that there were several hits that were "honestly" dropped at domain, as their main match was to "eukaryotic synthetic construct" that was also a strong match to Homo sapiens. I am guessing these hits should be filtered out to only match real living organisms, perhaps by only keeping results that belong to a domain.

The other thing I noticed is that some zOTUs whose taxonomic assignments were dropped at the same level might have inconsistent labelling as dropped. E.g.

Zotu4690 Bacteria dropped

vs.

Zotu4889 Bacteria dropped dropped

where the expected behavior would be to include "dropped" at all levels below the collapsed taxonomic level and not have blanks, as with:

Zotu103 Eukaryota Metazoa Chordata Aves Anseriformes Anatidae dropped dropped

I suspect this is what caused the non-cumulative number of "dropped" taxa. As far as I can tell, the blanks/non-cumulative "dropped" count only occurred in non-eukaryote hits. I know there are several taxa that do not have completely filled in (i.e. non-blank) taxonomic rankings (e.g. several algae like Haptophyta, or even green sea turtle), but the previous version always "accumulated" dropped taxa with increasing specificity so I suspect this goes beyond that issue. Let me know if you'd like copies of any files relevant to this issue.

cajwalsh commented 1 year ago

I now see that the post formatting removes relevant white spaces, but the "dropped" for 4690 was only at the species level with all others blank after domain, while the 4889 had "dropped" at order and species but was blank at all others but domain

mhoban commented 1 year ago

A lot of the blanks are just like that in their taxonomy and have nothing to do with either script. But I've made some edits to the R script and it should return mostly identical results to the python script now, with the one exception that it filters out more "uncultured", "synthetic", etc. type sequences (the python script is very specific about what it gets rid of while my approach is a little more broad. You're now also able to run the collapser script as a standalone thing (like how it was originally set up in the old version of the pipeline). It occurred to me that this might still be useful.

Anyway this stuff should be fixed as of 3b46031157d407 but I'll leave the issue open until it's had the chance to be tested a little bit.

cajwalsh commented 1 year ago

Just to leave traces of our conversation on this topic the other day, the reason I think "" should be different from "dropped" is because some zOTUs whose taxonomy wasn't collapsed contain "" in the database, e.g. many algae at kingdom rank and even green sea turtle somewhere (I forget where).

The other thing I mentioned was that I think having the option to remove the "uncultured" type sequences only when there is other fully identified ones may be useful. As you showed, 4 100% hits to a species and 1 100% hit to an "uncultured dinoflagellate" or something collapses the taxonomy to a much higher level, so it's useful to get rid of that. However, if only "uncultured dinoflagellate" hits occur for a given zOTU, it may be worth keeping those hits because someone has identified it to some extent, vs. eliminating that zOTU from blast taxonomy results altogether.

cajwalsh commented 9 months ago

I deal with this by filling in "dropped" and "unassigned" as different things here: https://github.com/cajwalsh/eDNAnalyze/blob/main/R/fill_blank_taxa.R

mhoban commented 9 months ago

yikes, look at those for loops! (j/k j/k)

mhoban commented 2 months ago

I think we've resolved this, with "dropped" definitely meaning dropped (and customizable from the command-line) and blank meaning nonexistent taxonomy at that level.