Closed LeeBergstrand closed 11 months ago
OK, seems reasonable to make our own basic QC workflow. I guess the basic QC workflow would include the following (inspired by ATLAS)?
I think the following steps are also performed by ATLAS, but they might not be needed in our case (because genome data is simpler than metagenome data, and we won't perform de novo assembly of short reads):
Lastly, I don't think we need to perform error correction of short reads or paired-end read merging, given that the short reads will only be used for polishing the long read-derived assembly.
How does that sound?
@jmtsuji I agree with your assessment. We may also apply the decontamination to the long reads for decontamination.
In Atlas you can actually specify files to filter by, for example human genomes: https://github.com/metagenome-atlas/atlas/blob/2871b8a1d6093b4caf1cad03145bcd3daa001f71/docs/advanced/qc.rst
Host contamination is becoming. So many database genomes have host human genes in them that it is starting to ruin some short-read taxonomy assignment algorithms. They start classifying host genes as bacterial because there's a close match to human genes found in contaminated reference genomes.
https://twitter.com/profbootyphd/status/1688386329405505536?s=61&t=dJHfNw7Lsr9K6D3WRAk6qA
In response to: https://www.nature.com/articles/s41586-020-2095-1
Sounds good to allow for multiple references for decontamination. Yeah, I've also heard about the human DNA contamination issue... very bad for short read taxonomy algorithms indeed...
(As a side note, one positive thing about rotary is that the majority of the output genomes will likely be complete, which minimizes the risks of human DNA contamination)
Not sure of the best tool for long read decontamination -- the high errors rates in the raw long reads could be an issue. If you find a tool for this that looks reasonable, then yes, feel free to add it to the workflow.
Regarding long read QC, we might also consider adding adapter trimming. Several groups still seem to be using Porechop for this, even though this tool is officially unsupported. Apparently adapters can persist in the Nanopore data even though I think the basecaller tries to filter them out.
merged in with https://github.com/jmtsuji/rotary/pull/77
I don't think we can directly port over the ATLAS QC directly over. However, we can at least copy some of the rules and modify them to be simpler in terms of variables called.