modify iddef strategy to 0 in combine_svs action

dhoogest commented 1 year ago

Followup from https://gitlab.labmed.uw.edu/molmicro/clampi-ngs/-/issues/61, we think it is more compatible when performing the combine_svs step on unmerged reads to go with the iddef=0 strategy for clustering using vsearch cluster_size. Here is the strategy definition list from the vsearch manual for reference:

--id real Do not add the target to the cluster if the pairwise identity with the centroid is lower than real (value ranging from 0.0 to 1.0 included). The pairwise identity is defined as the number of (matching columns) / (alignment length - terminal gaps). That definition can be modified by --iddef.

--iddef 0|1|2|3|4

Change the pairwise identity definition used in --id. Values accepted are:

CD-HIT definition: (matching columns) / (shortest sequence length).

edit distance: (matching columns) / (alignment length).

edit distance excluding terminal gaps (same as --id).

Marine Biological Lab definition counting each gap opening (internal or terminal) as a single mismatch, whether or not the gap was extended: 1.0

[(mismatches + gap openings)/(longest sequence length)]

BLAST definition, equivalent to --iddef 1 in a context of global pairwise alignment.

The issue linked above shows a liability for iddef=2 on unmerged reads - basically since terminal gaps are not counted when calculating the id, sequences with overlap on their ends may cluster. Switching to iddef=0 ensures that the full length of the shortest sequence within a pairwise comparison is used as the denominator when calculating id, which should protect against the 'overhang on both ends of aligned region' we're seeing currently (at the expense of greater number of total svs)

/cc @nhoffman @crosenth

nhoffman commented 1 year ago

@dhoogest - looks like an appropriate solution to me. Why have we never run into this before? Because we added adaptor trimming or did we just never see it?

dhoogest commented 1 year ago

@dhoogest - looks like an appropriate solution to me. Why have we never run into this before? Because we added adaptor trimming or did we just never see it?

I think that we'd really only see the phenomenon on the R1/R2 only pools (which we've only started to work with recently) - merged sequences wouldn't exhibit the problem given that they would already be combined into the longer sequence upstream of the combine_sv phase.

nhoffman / dada2-nf

modify iddef strategy to 0 in combine_svs action #73