Closed dhoogest closed 1 year ago
@dhoogest - looks like an appropriate solution to me. Why have we never run into this before? Because we added adaptor trimming or did we just never see it?
@dhoogest - looks like an appropriate solution to me. Why have we never run into this before? Because we added adaptor trimming or did we just never see it?
I think that we'd really only see the phenomenon on the R1/R2 only pools (which we've only started to work with recently) - merged sequences wouldn't exhibit the problem given that they would already be combined into the longer sequence upstream of the combine_sv phase.
Followup from https://gitlab.labmed.uw.edu/molmicro/clampi-ngs/-/issues/61, we think it is more compatible when performing the combine_svs step on unmerged reads to go with the
iddef=0
strategy for clustering usingvsearch cluster_size
. Here is the strategy definition list from the vsearch manual for reference:The issue linked above shows a liability for iddef=2 on unmerged reads - basically since terminal gaps are not counted when calculating the
id
, sequences with overlap on their ends may cluster. Switching to iddef=0 ensures that the full length of the shortest sequence within a pairwise comparison is used as the denominator when calculating id, which should protect against the 'overhang on both ends of aligned region' we're seeing currently (at the expense of greater number of total svs)/cc @nhoffman @crosenth