Open rjsorr opened 2 years ago
Hi @rjsorr,
Contaminants are automatically removed when you use one or more negative controls (with -c
flag), so you don't need to manually remove contaminants when using Recentrifuge. As you see, they are removed at different taxonomical levels depending on the contaminants and the taxonomic level that it is being considered for the analysis. You can identify the samples with contamination removed because the contain the substring _CTRL_
in any of the different outputs that Recentrifuge provides. In some cases, the control samples are so different from the regular samples that the default values for the filters may not be the most appropriate. For such cases, you have a couple of flags in rcf
to manually fine tuning algorithm parameters:
-z NUMBER, --ctrlminscore NUMBER
minimum score/confidence of the classification of a
read in control samples to pass the quality filter; it
defaults to "minscore"
-w INT, --ctrlmintaxa INT
minimum taxa to avoid collapsing one level into the
parent in control samples (if not specified a value
will be automatically assigned)
If you think that your control samples are too noisy, you can increase the values of these parameters to reduce the chances of false positives (false contaminants detected in the negative control samples). Finally, sure, an additional, optional, separate output (beyond the console log) devoted to the contamination removal algorithm would be a welcomed addition.
sorry for the slow reply @khyox, The problem of removing read contaminants based on read classification, as I think I'm now struggling with, is the reliance on databases and their completness. I see now that the gammaproteobacteria present as a contaminant in the negative controls is a novel/new species that cannot be classified to a lower taxonomic level. As such, it's uncertain classification based on current databases is causing an interpretation problem were an entire class is being flagged as a contaminant when it is actually a single novel species with poor classification that is causing an issue. I don't see how changing the above parameters will help with this when the underlying problem is database/classification related? maybe you have some suggestions as how to attack this, other than assemble the MAG, which I have done, and then map it the reads?
Metagenomic databases have improved a lot over time but are still very far from perfect. I would say that, if you have identified a clear problem in the DB you can try to correct it to avoid the issue in the source instead to having to correct it downstream. I understand that there are times where that is not so easy. If the classification is poor as you mention, you may have luckily a low classification score for such a taxon (that's another benefit of using score-oriented classification!), so the --ctrlminscore
would be very helpful. In addition, if such a taxon is a minority one in the control samples, then you can use --ctrlmintaxa
making it very low so that just the lowest possible level is flagged as contaminant and not an upper level, so that you would minimize the "damage" (of the DB problem upstream), keeping it at the lowest level. Alternatively, if you used Centrifuge, you can use rextract
to get the reads that were misclassified and remove them from the controls. You can also develop a small script to delete assignations to that taxon in the results from the controls —if you used recentrifuge's --exclude
option you would also remove them from the regular samples, so unfortunately that's not an option in this case.
Hi @khyox, I'm trying to understand/interpret the log file so as I can remove contaminats manually based on taxonids. I would like to remove those that have been assigned as "critical". However, it is difficult to understand what these are from the log file attached. If I run find for "critical" it highlights 18 hits, at different taxonomical levels. However, some of these critical hits are class level taxonids (e.g. Actionbacteria and Gammaproteobacteria), and I cannot possibly imagine that these should be removed from the dataset, and I'm guessing a lower taxonomic rank is actually being flagged, but as I write, not easy to see? A seperate contamiation output file that gives a simple result for interpretation and downstream processing would be a welcomed addition? Recentrifuge17_log.txt
regards