Figure out deduplication

jeffkaufman commented 1 year ago

The sequencing process can occasionally give duplicates, where one fragment gets sequenced twice. These aren't independent observations, and it's common for people to remove duplicates before processing the data. In reasonably deep sequencing of short fragments of common things (ex: Tobamoviruses), however, it seems like 'duplicates' are often real. It would be good to have some way of taking this into account where we still are appropriately quantitative for things with a high probability of happening to sequence identical fragments.

One idea would be to only apply deduplication to reads representing human-infecting viruses, since these are consistently very low relative abundance and, with the sequencing depths we have, identical reads are very likely to be sequencing artifacts?

mikemc commented 1 year ago

Initial thoughts:

One argument for ignoring this for now is that stochastic effects of PCR add noise that in principle can be captured by our statistical model, and so should result in more uncertainty in our estimates (and will decrease with more independently sequenced samples), which seems preferable to doing this incorrectly in a way that adds unknown bias (systematic error) leading to over-confident biased estimates.

Looking in depth at some examples where Fastq and/or our own version report high duplicate levels can help us understand the scale of the problem and also the potential mechanisms; doing this seems useful for downstream interpretation even if we don't implement the filltering.

Running the reads through a tool like Ribodetector to classify reads coming from the bacterial ribosomal genes seems like a useful step in understanding the mechanism, as these were ~45% of reads in Rothman and are coming from just a couple genes (those genes have variation among bacterial species, but vary little enough that reads from different species would still get flagged as duplicates by the Fastq approach of looking at the first ~75bp)

Having diagnostic statistics for all datasets about the flagged duplicates by a simple algorithm like the Fastq one is useful for understanding the data even if we don't implement filtering based on it.

One idea would be to only apply deduplication to reads representing human-infecting viruses, since these are consistently very low relative abundance and, with the sequencing depths we have, identical reads are very likely to be sequencing artifacts?

Not something to rush into, but thinking through what an 'ideal' approach might look like...

An initial stab at implementing this behavior using a more general-purpose/data-driven approach could be to use an iterative procedure, where we estimate the relative abundances (RAs) of the viruses from the non-dedup'd data, then use that to estimate the number of expected alignments to a location and thus estimate the number of hits that are real vs duplicated. Removing the duplicates would change the implied RAs. This could be implemented as an Expectation-Maximization algorithm that runs through a few iterations. This approach doesn't account for the large variation in coverage across the genome though.

jeffkaufman commented 1 month ago

We ended up implementing dedup on human-viral reads only.

naobservatory / mgs-pipeline

Figure out deduplication #3