mhoban / rainbow_bridge

GNU General Public License v3.0
5 stars 2 forks source link

`qcov` filter in `collapse_taxonomy.R` virtually negates the function of `diff_thresh` #98

Closed cbird808 closed 1 month ago

cbird808 commented 2 months ago

Hi @mhoban ,

I've been investigating why we are losing so many zotu in LCA or getting assignments of zotu to more specific taxa than we thought prudent, I noticed that the qcov filter in the block of code was really wiping out the taxonomic diversity with the range of the diff_threshold. As an example, if one taxid had 95 pident and 100 qcov, while another taxid had 96 pident and 99 qcov, the latter taxon with the higher pident was eliminated by the hardcoded qcov filter before the diff_threshold filter even had a chance to decide.

# now retain only assignments within the specified percent ID difference threshold
filtered <- filtered %>%
  # group by zotu
  group_by(zotu) %>%
  # retain only the highest query coverage (often 100%) within each zotu
  filter(qcov == max(qcov)) %>% 
  # now calculate difference between each and the max pident within each zotu
  mutate(diff = abs(pident - max(pident))) %>%
  # discard anything with a difference over the threshold within each zotu
  filter(diff < diff_thresh) %>%
  ungroup() 

one potential solution is to just let the diff filter do the filtering:

# now retain only assignments within the specified percent ID difference threshold
filtered <- filtered %>%
  # group by zotu
  group_by(zotu) %>%
  # retain only the highest query coverage (often 100%) within each zotu
  # filter(qcov == max(qcov)) %>%  #CEB qcov not a great arbitor here
  # now calculate difference between each and the max pident within each zotu
  mutate(diff = abs(pident - max(pident))) %>%
  # discard anything with a difference over the threshold within each zotu
  filter(diff < diff_thresh) %>%
  ungroup() 

alternatively, adding an option/argument to control the stringency of the qcov filter could also work, but is more work.

mhoban commented 1 month ago

For now, I'll make this filtering step optional (via command-line option), and think about what an option to control the filter's stringency would look like:

option_list <- list(
  ...
  make_option(c("-f", "--filter-max-qcov"), action="store_true", default=FALSE, type='logical', help="Retain only records with the highest query coverage"),
  ...
)
filtered <- filtered %>%
  # group by zotu
  group_by(zotu) %>%
  # conditionally retain only the highest query coverage within each zotu
  { if (opt$options$filter_max_qcov) filter(.,qcov == max(qcov)) else . } %>%
  # now calculate difference between each and the max pident within each zotu
  mutate(diff = abs(pident - max(pident))) %>%
  # discard anything with a difference over the threshold within each zotu
  filter(diff < diff_thresh) %>%
  ungroup() 
mhoban commented 1 month ago

This should now be addressed (and merged into main)