Read duplication - Githubissues

Hi Michael,

It's true that there's a high number of duplicated reads, and an approach could be to deduplicate using a tool like BBDuk, MaxBin, MetaBAT, CONCOCT, or BINSANITY (they've got some great names!).

However, this will cause downstream challenges because my pipeline currently doesn't refer back to the bin counts in order to estimate the amount of activity for each action in the sample. So if you bin, it will skew the percentages of each activity for the overall sample.

There are some really interesting tools coming out, like SqueezeMeta, which are smart enough to perform binning before the annotation stage, greatly reducing the memory cost and speeding up the annotation.

If you're looking for a faster method, I'd suggest checking it out - I've been thinking about adding an optional binning step before the annotation, but I'm not likely to have the dedicated time to build this in the super-near future, and it will take some thought in order to reconnect the annotations back to the bins.

transcript / samsa2

Read duplication #55