lentendu / DeltaMP

A flexible, reproducible and resource efficient metabarcoding amplicon pipeline for HPC
GNU General Public License v3.0
2 stars 1 forks source link

Feature: similarity based clustering for pair-end sequences which cannot be assembled #175

Open lentendu opened 1 year ago

lentendu commented 1 year ago

Due to bad 3'-end quality, some datasets cannot be pair-end assembled. So add an option to force analyses even without pair-end assembly. Possible for dissimilarity based clustering:

  1. optimize length trimming based on maxEE/quality
  2. if too low number of pair-end assembled seq and force set on, skip pair-end at trim step
  3. unique et all pair's dissimilarities per fragment
  4. resolve the unique set of assembled dereplicated sequences from both strand (and add a xx length gap between both fragments)
  5. averaged dissimilarity for all pair of assembled dereplicated sequences weighted by length
  6. clustering (sumaclust or MCL) on average dissimilarity
  7. vsearch usearch-global with no gap extension penalty for the query sequence (--gapext 2IT/0IQ/1E) --> to test