Missing a length filter cross-reference in `cluster-features-*` actions

qiime2 / q2-vsearch

vsearch plugin for QIIME 2

BSD 3-Clause "New" or "Revised" License

6 stars 20 forks source link

Missing a length filter cross-reference in `cluster-features-*` actions #69

Closed thermokarst closed 4 years ago

thermokarst commented 4 years ago

Bug Description vsearch apparently applies a minimum length filter of 32 nts to input sequences - our cluster-features-* actions appear to assume that no reads are going to be filtered by vsearch, so there is no cross-referencing or post-vsearch filtering applied.

Steps to reproduce the behavior

Please see reference 1, below.

Expected behavior I see at least two ways to solve, detailed in questions 1 and 2, below.

Questions

Should we solve by applying post-vsearch filtering? If so, how should we report the filtered sequences back to the user? Is this a new output, or is it lumped in with one of the existing outputs?
Should (can?) we solve this by removing the min-length filter on vsearch?

References

https://forum.qiime2.org/t/error-when-renning-cluster-features-de-novo/14878

nbokulich commented 4 years ago

in case this helps direct decision-making on this issue (esp. re: question 1), I have added a method to one of my plugins that uses vsearch to filter a FeatureData[Sequence] artifact by length. I am planning on releasing this in 2020.6.

So if it's possible, you could disable the min-length filter (or set it to the lowest threshold possible), and users can apply their own length filter post-clustering if desired.

torognes commented 4 years ago

Yes, vsearch applies a minimum sequence length filter of 32 nucleotides for clustering, dereplication and search commands (cluster_smallmem, cluster_fast, cluster_size, cluster_unoise, derep_fulllength, derep_prefix, makeudb_usearch, sintax, usearch_global) and 1 for other commands. This was implemented for maximum compatibility with usearch (version 7). It can be turned off with the option --minseqlength 1 for the commands where it is relevant.

thermokarst commented 4 years ago

Thanks @torognes!

@Oddant1 - can you please work on this bug when you get a chance, it would be great if we could resolve it in time for 2020.6. I think using the --minseqlength 1 flag in the internal calls to vsearch should handle this (which means we go with option 2 above, in my original post, as long as @nbokulich agrees).