interpreting log file for contamination removal

rjsorr commented 2 years ago

Hi @khyox, I'm trying to understand/interpret the log file so as I can remove contaminats manually based on taxonids. I would like to remove those that have been assigned as "critical". However, it is difficult to understand what these are from the log file attached. If I run find for "critical" it highlights 18 hits, at different taxonomical levels. However, some of these critical hits are class level taxonids (e.g. Actionbacteria and Gammaproteobacteria), and I cannot possibly imagine that these should be removed from the dataset, and I'm guessing a lower taxonomic rank is actually being flagged, but as I write, not easy to see? A seperate contamiation output file that gives a simple result for interpretation and downstream processing would be a welcomed addition? Recentrifuge17_log.txt

regards

khyox commented 2 years ago

Hi @rjsorr, Contaminants are automatically removed when you use one or more negative controls (with -c flag), so you don't need to manually remove contaminants when using Recentrifuge. As you see, they are removed at different taxonomical levels depending on the contaminants and the taxonomic level that it is being considered for the analysis. You can identify the samples with contamination removed because the contain the substring _CTRL_ in any of the different outputs that Recentrifuge provides. In some cases, the control samples are so different from the regular samples that the default values for the filters may not be the most appropriate. For such cases, you have a couple of flags in rcf to manually fine tuning algorithm parameters:

  -z NUMBER, --ctrlminscore NUMBER
                        minimum score/confidence of the classification of a
                        read in control samples to pass the quality filter; it
                        defaults to "minscore"
  -w INT, --ctrlmintaxa INT
                        minimum taxa to avoid collapsing one level into the
                        parent in control samples (if not specified a value
                        will be automatically assigned)

If you think that your control samples are too noisy, you can increase the values of these parameters to reduce the chances of false positives (false contaminants detected in the negative control samples). Finally, sure, an additional, optional, separate output (beyond the console log) devoted to the contamination removal algorithm would be a welcomed addition.

rjsorr commented 2 years ago

sorry for the slow reply @khyox, The problem of removing read contaminants based on read classification, as I think I'm now struggling with, is the reliance on databases and their completness. I see now that the gammaproteobacteria present as a contaminant in the negative controls is a novel/new species that cannot be classified to a lower taxonomic level. As such, it's uncertain classification based on current databases is causing an interpretation problem were an entire class is being flagged as a contaminant when it is actually a single novel species with poor classification that is causing an issue. I don't see how changing the above parameters will help with this when the underlying problem is database/classification related? maybe you have some suggestions as how to attack this, other than assemble the MAG, which I have done, and then map it the reads?

khyox commented 2 years ago

Metagenomic databases have improved a lot over time but are still very far from perfect. I would say that, if you have identified a clear problem in the DB you can try to correct it to avoid the issue in the source instead to having to correct it downstream. I understand that there are times where that is not so easy. If the classification is poor as you mention, you may have luckily a low classification score for such a taxon (that's another benefit of using score-oriented classification!), so the --ctrlminscore would be very helpful. In addition, if such a taxon is a minority one in the control samples, then you can use --ctrlmintaxa making it very low so that just the lowest possible level is flagged as contaminant and not an upper level, so that you would minimize the "damage" (of the DB problem upstream), keeping it at the lowest level. Alternatively, if you used Centrifuge, you can use rextract to get the reads that were misclassified and remove them from the controls. You can also develop a small script to delete assignations to that taxon in the results from the controls —if you used recentrifuge's --exclude option you would also remove them from the regular samples, so unfortunately that's not an option in this case.

khyox / recentrifuge

interpreting log file for contamination removal #40