Closed cbird808 closed 1 month ago
For now, I'll make this filtering step optional (via command-line option), and think about what an option to control the filter's stringency would look like:
option_list <- list(
...
make_option(c("-f", "--filter-max-qcov"), action="store_true", default=FALSE, type='logical', help="Retain only records with the highest query coverage"),
...
)
filtered <- filtered %>%
# group by zotu
group_by(zotu) %>%
# conditionally retain only the highest query coverage within each zotu
{ if (opt$options$filter_max_qcov) filter(.,qcov == max(qcov)) else . } %>%
# now calculate difference between each and the max pident within each zotu
mutate(diff = abs(pident - max(pident))) %>%
# discard anything with a difference over the threshold within each zotu
filter(diff < diff_thresh) %>%
ungroup()
This should now be addressed (and merged into main)
Hi @mhoban ,
I've been investigating why we are losing so many zotu in LCA or getting assignments of zotu to more specific taxa than we thought prudent, I noticed that the
qcov
filter in the block of code was really wiping out the taxonomic diversity with the range of thediff_threshold
. As an example, if onetaxid
had 95pident
and 100qcov
, while another taxid had 96pident
and 99qcov
, the latter taxon with the higher pident was eliminated by the hardcodedqcov
filter before thediff_threshold
filter even had a chance to decide.one potential solution is to just let the diff filter do the filtering:
alternatively, adding an option/argument to control the stringency of the qcov filter could also work, but is more work.