mhoban / rainbow_bridge

GNU General Public License v3.0
5 stars 2 forks source link

Overly stringent filtering in `collapse_taxonomy.R` #97

Closed cbird808 closed 1 month ago

cbird808 commented 1 month ago

Hi @mhoban ,

I believe that the following function in collapse_taxonomy.R results many fewer records than intended.

# for multiple seqid matches, keep just the best one
best_score <- function(x,...) {
  x %>%
    group_by(...) %>%
    filter(
      pident == max(pident), 
      evalue == min(evalue),
      bitscore == max(bitscore),
      mismatch == min(mismatch)
    ) %>%
    arrange(desc(pident),evalue,desc(bitscore),mismatch) %>%
    slice(1) %>%
    ungroup()
}

The commas separating the filtering criteria in dplyr::filter() signify AND (&). Consequently, any record that does not simultaneously have the highest pident, lowest evalue, highest bitscore, and lowest mismatch is removed. This can and does result in the elimination of taxa from consideration for a zotu when run with this line: best_score(zotu,taxid) %>%

I believe the intended behavior is to select the highest pident, then break ties with evalue, and so on:

# CEB for multiple seqid matches, keep just the best one
best_score <- function(x,...) {
  x %>%
    group_by(...) %>%
    filter(pident == max(pident)) %>%  #CEB
    filter(evalue == min(evalue)) %>%  #CEB
    filter(bitscore == max(bitscore)) %>%  #CEB
    filter(mismatch == min(mismatch)) %>%  #CEB
    arrange(desc(pident),evalue,desc(bitscore),mismatch) %>%   #CEB
    slice(1) %>%
    ungroup()
}

This change will reduce the number of overly optimistic false positive classifications of zotu to lca.

Perhaps more equivocally, I think that evalue should have 1st priority over pident because it is more holistic. An alternative to the hierarchical prioritization of these 4 metrics would be to just keep the best escore, pident, bitscore, mismatch (and qcov) for a given taxid across all seqid. As an example, if you have a zotu*taxid with 2 seqids that yield the following (qcov,pident) pairs: 99,100 and 100,99, you could save 100,100.

And one last related thing to this section of the code, I believe that best_score(zotu,seqid) %>% never filters any records. I guess it's not hurting anything. Perhaps if multiple blast searches of the same zotus are combined together then it would remove duplicates, but any one blast search of any 1 zotu shouldn't return duplicate seqid. No?

mhoban commented 1 month ago

Good catch, thanks.

I put this in because I was finding cases where zOTU sequences were getting multiple matches to the same sequence ID, usually when that was something long like a mitogenome or (more typically) a bacterial chromosome. Since the goal was taxonomy assignment, I thought it would be prudent to just keep one of each zOTU-seqid match, but to retain the best one if there were multiple hits.

Ultimately, these cases (multiple seqid matches) are probably due to the blastn parameters being passed. Right now, some of these are baked in to the pipeline (copied from the original eDNAFlow code), but I plan to make it so the user can modify any parameter they want using --blast-<blastn_parameter> (c.f. #74).

This is an aside, but I (like many) still struggle with the blastn command line options. They're poorly documented and sometimes they don't do what the documentation (should it exist) implies they do. In addition, the options on the web-based BLAST don't intuitively match the command-line options for blastn (even though they are ostensibly the same tool), such that I've almost never successfully had the results of a command-line blast run match those of a web-based blast run.

mhoban commented 1 month ago

I think that evalue should have 1st priority over pident

I think you're right. I'll take your suggestion in addition to the changes to the filtering logic.

An alternative to the hierarchical prioritization of these 4 metrics would be to just keep the best escore, pident, bitscore, mismatch (and qcov) for a given taxid across all seqid. As an example, if you have a zotu*taxid with 2 seqids that yield the following (qcov,pident) pairs: 99,100 and 100,99, you could save 100,100.

So rather than keeping the one result with the best hierarchically-prioritized scores, collapse the record by grouping factor and keep all the best values? I can see how this makes sense for the zotu x seqid grouping, but I'm not sure it makes as much sense for the zotu x taxid grouping, since presumably the grouped results are different sequences.

cbird808 commented 1 month ago

My primary argument was for evalue to have highest influence in determining which seqid was used to represent the relationship between the zotu and the taxon.

While I didn't put much thought into the second idea, and you should feel free to ignore it, the basic premise was that there are alternative ways of summarizing all of the retained seqids within a taxon within a zotu other than identifying the seqid(s) with the lowest evalue, highest pident, etc. Once there's 1 row per taxid per zotu, the blast values represent the relationship between the zotu and the taxon and could be based on more the seqid(s) with the highest pident (e.g. the mean, or whatever metric you want). I wouldn't have made this comment if you had evalue as the primary decider, it was a secondary option to buffer potential pitfalls of relying on pident first.

From: mykle hoban @.> Sent: Tuesday, September 24, 2024 4:08 PM To: mhoban/rainbow_bridge @.> Cc: Bird, Chris @.>; Author @.> Subject: Re: [mhoban/rainbow_bridge] Overly stringent filtering in collapse_taxonomy.R (Issue #97)

I think that evalue should have 1st priority over pident

I think you're right. I'll take your suggestion in addition to the changes to the filtering logic.

An alternative to the hierarchical prioritization of these 4 metrics would be to just keep the best escore, pident, bitscore, mismatch (and qcov) for a given taxid across all seqid. As an example, if you have a zotu*taxid with 2 seqids that yield the following (qcov,pident) pairs: 99,100 and 100,99, you could save 100,100.

So rather than keeping the one result with the best hierarchically-prioritized scores, collapse the record by grouping factor and keep all the best values? I can see how this makes sense for the zotu x seqid grouping, but I'm not sure it makes as much sense for the zotu x taxid grouping, since presumably the grouped results are different sequences.

- Reply to this email directly, view it on GitHubhttps://github.com/mhoban/rainbow_bridge/issues/97#issuecomment-2372388352, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBV4SZ3I372SWRTJZUAU63ZYHIC5AVCNFSM6AAAAABOXCVYRKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZSGM4DQMZVGI. You are receiving this because you authored the thread.Message ID: @.**@.>>

mhoban commented 1 month ago

My primary argument was for evalue to have highest influence in determining which seqid was used to represent the relationship between the zotu and the taxon.

Yes, that's how it'll be implemented.

Initially (and I think this may be something of an edge case), we'll summarize multiple hits to the same sequence and retain the best version of each score. Then, when choosing sequences for taxids to represent the zotu, evalue will be prioritized first, followed by pident, bitscore, and mismatch. Maybe this priority list can ultimately be customizable.

mhoban commented 1 month ago

This should now be addressed (and merged into main)