rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
3 stars 1 forks source link

Add automated Illumina QC (similar to ATLAS). #38

Closed LeeBergstrand closed 11 months ago

LeeBergstrand commented 1 year ago

I don't think we can directly port over the ATLAS QC directly over. However, we can at least copy some of the rules and modify them to be simpler in terms of variables called.

jmtsuji commented 1 year ago

OK, seems reasonable to make our own basic QC workflow. I guess the basic QC workflow would include the following (inspired by ATLAS)?

I think the following steps are also performed by ATLAS, but they might not be needed in our case (because genome data is simpler than metagenome data, and we won't perform de novo assembly of short reads):

Lastly, I don't think we need to perform error correction of short reads or paired-end read merging, given that the short reads will only be used for polishing the long read-derived assembly.

How does that sound?

LeeBergstrand commented 1 year ago

@jmtsuji I agree with your assessment. We may also apply the decontamination to the long reads for decontamination.

In Atlas you can actually specify files to filter by, for example human genomes: https://github.com/metagenome-atlas/atlas/blob/2871b8a1d6093b4caf1cad03145bcd3daa001f71/docs/advanced/qc.rst

LeeBergstrand commented 1 year ago

Host contamination is becoming. So many database genomes have host human genes in them that it is starting to ruin some short-read taxonomy assignment algorithms. They start classifying host genes as bacterial because there's a close match to human genes found in contaminated reference genomes.

https://twitter.com/profbootyphd/status/1688386329405505536?s=61&t=dJHfNw7Lsr9K6D3WRAk6qA

In response to: https://www.nature.com/articles/s41586-020-2095-1

jmtsuji commented 1 year ago

Sounds good to allow for multiple references for decontamination. Yeah, I've also heard about the human DNA contamination issue... very bad for short read taxonomy algorithms indeed...

(As a side note, one positive thing about rotary is that the majority of the output genomes will likely be complete, which minimizes the risks of human DNA contamination)

Not sure of the best tool for long read decontamination -- the high errors rates in the raw long reads could be an issue. If you find a tool for this that looks reasonable, then yes, feel free to add it to the workflow.

Regarding long read QC, we might also consider adding adapter trimming. Several groups still seem to be using Porechop for this, even though this tool is officially unsupported. Apparently adapters can persist in the Nanopore data even though I think the basecaller tries to filter them out.

LeeBergstrand commented 11 months ago

merged in with https://github.com/jmtsuji/rotary/pull/77