98-99% contamination estimates

lnpott9 commented 4 weeks ago

Hello! I am getting really high contamination estimates for several samples that don't have any other indications that they were contamination during extraction or library prep (and other samples from the same batches do not have high contamination estimates). They range in read depth from 3.3X-1695X and in coverage from 77-100% (mapped to the rCRS).

I ran them all through contDeam, followed by schmutzi with --notusepredC, then schmutzi without --notusepredC and using the worldwide allele frequency database, then schmutzi with --notusepredC and using the worldwide database.

Two of the samples that had 99% contamination estimates with --notusepredC decreased down to 1% after running schmutzi without --notusepredC. However, 12 of the samples still had estimates of 99%.

I ran those through schmutzi a third time with --notusepredC and using the worldwide database, but several of the samples still have estimates of 99%. Would it help to filter them using a program like PMDtools and run them through schmutzi again? Or are the contamination estimates likely correct?

grenaud commented 4 weeks ago

Thank you for your input! I added a question to the readme: https://github.com/grenaud/schmutzi/

I am adding the response here:

I am getting a very high contamination rate (98%-99%), why?

This is likely a fluke, first make sure that you're not running with a prediction of the contaminant (using option --notusepredC). Not using this option only works if you have very high contamination rates. The second thing is to make sure that you have sufficient coverage (at least 10-15X on the mitochondrial genome). If not, that means that you cannot infer the endogenous consensus properly, however, we have a near-perfect resolution of the contaminant because we have the database. This means that the most likely explanation is that everything is contaminated. We highly suggest to disregard the contamination estimates for samples with less than 10x and doubt the estimate of those between 10-20X.

lnpott9 commented 4 weeks ago

Thank you! Just out of curiosity, does using deduplicated vs rescaled input files have any impact on this? I've run schmutzi on both types of input files for the same samples, and none of the deduplicated files had high contamination estimates despite having the same coverage.

grenaud commented 3 weeks ago

deduplicated should be technically fine however I do not know a single tool that does the duplication of mitochondrial reads in a mathematically sensible way. In a sense that you might pick the same molecule again by chance especially if you're genome is 16k in length, you might overwrite deamination by calling a consensus, etc. Unless the duplication rates are egregious, I would not even bother with it.

rescaled I would not do it, schmutzi uses quality scores to determine if a substitution is likely to be a sequencing error or deamination.

grenaud / schmutzi

98-99% contamination estimates #28