transcript / samsa2

SAMSA pipeline, version 2.0. An open-source metatranscriptomics pipeline for analyzing microbiome data, built around DIAMOND and customizable reference databases.
GNU General Public License v3.0
54 stars 36 forks source link

Read duplication #55

Closed mweberr closed 4 years ago

mweberr commented 4 years ago

Hi all, do you have experience with duplicated reads ? The annotation step with DIAMOND is extremely slow because of the high number of duplicated reads. Do you have ideas to deduplicate the dataset before annotation BLAST ?

Best, Michael

transcript commented 4 years ago

Hi Michael,

It's true that there's a high number of duplicated reads, and an approach could be to deduplicate using a tool like BBDuk, MaxBin, MetaBAT, CONCOCT, or BINSANITY (they've got some great names!).

However, this will cause downstream challenges because my pipeline currently doesn't refer back to the bin counts in order to estimate the amount of activity for each action in the sample. So if you bin, it will skew the percentages of each activity for the overall sample.

There are some really interesting tools coming out, like SqueezeMeta, which are smart enough to perform binning before the annotation stage, greatly reducing the memory cost and speeding up the annotation.

If you're looking for a faster method, I'd suggest checking it out - I've been thinking about adding an optional binning step before the annotation, but I'm not likely to have the dedicated time to build this in the super-near future, and it will take some thought in order to reconnect the annotations back to the bins.