collapse-taxonomy not working with pid = 100, diff = 0

cajwalsh commented 1 month ago

Part of the pre-processing I do for Molly/Pristine Seas is running the taxonomy collapse process a few times with different thresholds in a roundabout way to report the highest pident of any blast hit for a zOTU. I normally start with pid = 100, diff = 0 to show which had 100 and only keep 100% matches for zOTUs that had them (Molly's choice, understanding the risks). I noticed in my most recent run this returned no assignments (even though some zOTUs only had 100% matches in the blast file). My only thought so far was that this could be due to the changes on August 15th on lines 210-214 ish. While I have done runs since then, I haven't done this taxonomy progression for my own analyses so haven't used the 100, 0 conditions.

In any case, it looks like the issue you added a few hours ago (presumably because Molly also talked about this information at her presentation I wasn't able to attend today) could eliminate the riskier need for this progression. My suggestion would be to include the max pid (after qcov filtering) for a zOTU rather than the average, or at least only the average for the blast hits used to assign taxonomy (ie within the diff of the highest match above minimum pid). I think the second (average of retained hits) is what you suggest in #105.

Thanks!

mhoban commented 1 month ago

But in all seriousness, what is the expected outcome when using these parameters? According to the documentation (and the implementation):

Setting diff means that if the absolute value of the difference in pid between two BLAST results is < diff, then a species level taxonomy will be returned.

Since it's not possible to have an absolute value less than zero, it's not possible to get percent ID differences less than zero. And when you include --pid 100, all percent ID differences will by definition be zero (because all %ID values will be 100).

I could change the comparison to diff <= diff_thresh if you think that will affect the outcome in a meaningful way, but --diff 1 --pid 100 effectively does what I think you're expecting to happen with --diff 0 --pid 100.

Alternatively you could set diff to something small like 0.0001.

The suggestion in #105 to add average %id actually came independent of Molly's presentation, it just seemed useful. And the idea was to use the average of the blast results that survived filtering criteria and were (as you say) used to decide the final taxonomic assignment. If max pid is useful, I can easily add that as well!

The updates to collapse_taxonomy.R came in response to a few recent issues submitted by @cbird808 (#97,#98,#99).

cbird808 commented 1 month ago

I'll chime in here, since this script (collapse_taxonmy.R) is sortof fresh in my head, at least the last version of it. I'm looking at the newest version on github as I write this.

@cajwalsh, the filter on line 159 will remove records if their qcov is less than the threshold setting regardless of pident threshold. This is the only line where the pid (pident_threshold) is applied. The diff_thresh filter is applied on line 214 and it will remove any record with a pident < pid. The calculation of the diff for each record occurs on line 212, and looks good to me. So, either qcov is causing the records to be tossed, or they are getting removed at another point. If you think it's the latter, provide a the blast file for a single zotu that is erroneously being removed. See blast_result_Zotu66.tsv.txt for an example.

@mhoban The diff setting determines whether a record is retained for the final taxonomic assignment. If set to zero, it will typically return 1 record per zotu, and then will assign a species regardless of other settings. With diff = 0 and aggressive pident and qcov filters (100,100), a species level assignment is more likely to be valid. If all that is the case, and the queried sequence is long enough to uniquely identify species and there's no mito introgression (assuming coi), then the sp level assignment is valid. With more lenient qcov and pident settings, a diff of zero will likely return many erroneous species level assignments. As the e value rises (qcov and pident falls), a wider diff will result in a more accurate taxonomic assignment.

Finally, @cajwalsh, in terms of running this over and over again with different settings, would it not be easier to just run it once with the best settings, then right join the blast results to the zotu table and filter by the parameter of your choice? Say "hi" to Molly for me

cbird808 commented 1 month ago

@mhoban now that I've introduced a troubleshooting data file and @cajwalsh is probably about to submit a seond, what are your thoughts on setting up unit testing?

mhoban commented 1 month ago

@cbird808 it's probably a good idea! There is a test suite for nextflow that works with github actions. I can look into it.

cajwalsh commented 1 month ago

Hi again,

In this case yes I was just asking for something to only return the 100% matches (and collapse if there are multiple) so setting min pid to 100% (with default diff 1) would work. I hadn't thought of that last night but did just want to inquire about the functionality/possibility of diff being set to 0 though as the blank output file did not seem ideal or intuitive for other users perhaps (if for no other reason than it returned something different than before, my bad for using it wrong/not understanding the new docs clearly). And in this case yes the risk of only keeping 100% matches for zOTUs that had them was understood and discussed in a few meetings probably over a year ago. Again I chose not to do this when running stuff for my own studies/purposes but agreed to do it for the other data reporting folks asked for. I'd say it's a relic of how we wanted to look at/manually filter zOTUs/taxa two years ago when all of this pipeline and automated filtering was different.

@cbird808 It is/was true that running it multiple times was not actually necessary (besides classifying those with and without 100% matches separately as I was asked to do), although running it four times (100 0; 99 0.9999, 98 0.9999, 97 1) before didn't really take a whole lot of time (each takes like a few seconds) and it was easier in my hacky brain to write a quick R function to merge them all and keep only best at the end. From what I understand with the new version though, the 0.9999 is no longer necessary (or maybe it wasn't and I was superstitious) and keeping it at 1 in those cases would be fine.

So in the end, my point in my second paragraph was more that implementing #105 would eliminate my desire to run repeated taxonomy collapsing. Sorry I assumed it might've come up yesterday, all data given to Molly has a column with 100, 99, 98, or 97 min pid level it was classified using that I have added by doing the process mentioned in paragraph above by one of my R package functions. The topic of my issue is solved (to consider zOTUs with 100% matches separately, should just be done with just pid 100 and not diff 0 as well, although diff 0 worked before as well).

Sorry for the long issue I guess lol (I have definitely made longer ones before but I guess just not as stupid apparently)

mhoban commented 1 month ago

Sorry for the long issue I guess lol (I have definitely made longer ones before but I guess just not as stupid apparently)

Not stupid! I just looked at this early in the morning and my eyes blurred a little. It’s a good point and I think you’re right that just returning nothing is maybe not the best behavior.

mhoban commented 1 month ago

What I'm going to do for now is have the pipeline throw an error if you give a value to --lca-diff that is less than or equal to zero.

mhoban / rainbow_bridge

collapse-taxonomy not working with pid = 100, diff = 0 #106