rotary-genomics / rotary

Assembly/annotation workflow for Nanopore-based microbial genome data containing circular DNA elements
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Short read QC #75

Closed jmtsuji closed 9 months ago

jmtsuji commented 10 months ago

This code adds basic short read QC functionality to rotary to address #38. Summary of changes:

I did not add good logging for QC stats. The user needs to manually look at log files to understand what happened during QC. In future, it could be nice to add a rule that parses the log files to summarize QC results.

Lastly, the current code passes --dry-run on my machine but has only been tested in smaller chunks with real data. A proper end-to-end test will be needed at some point (maybe once everything is merged into #71 ?).

@LeeBergstrand Mind adding your review? It looks like there will be no merge conflicts with improve_multi_sample_pathing (the branch that this branch was derived from), which is good news! Thanks.

LeeBergstrand commented 10 months ago

@jmtsuji This looks good so far. I left a few comments.

jmtsuji commented 10 months ago

@LeeBergstrand Addressed your comments -- mind taking another look?

Also, what are your thoughts on the following (pasted from the PR description above)?

  • TODO: the adapters file from ATLAS is directly downloaded from Zenodo - do we want to make and host our own file?
  • Note that adapter trimming and quality trimming are currently non-optional. We can add the ability to toggle these on/off if you would like.
  • I set a relatively aggressive q-trim threshold of Q20 [as default]
LeeBergstrand commented 10 months ago

@jmtsuji

  • TODO: the adapters file from ATLAS is directly downloaded from Zenodo - do we want to make and host our own file?

We could put this down as a long-term TODO. I wonder if the archive will go down or if ATLAS will stop using this URL.

  • Note that adapter trimming and quality trimming are currently non-optional. We can add the ability to toggle these on/off if you would like.

This should be implemented. However, I would like to merge the QC code into develop beforehand.

  • I set a relatively aggressive q-trim threshold of Q20 [as default]

This is fine for Illumina, which should have reasonable error rates. We should have more stringent cutoffs since we are polishing with the short reads.

jmtsuji commented 10 months ago

@LeeBergstrand

We could put this down as a long-term TODO. I wonder if the archive will go down or if ATLAS will stop using this URL.

Addressed - the code now pulls from a rotary-specific data repo that contains the adapters file: https://zenodo.org/records/10087395 Attribution is given to the ATLAS repo.

jmtsuji commented 10 months ago

This should be implemented. However, I would like to merge the QC code into develop beforehand.

OK, sounds good.

This is fine for Illumina, which should have reasonable error rates. We should have more stringent cutoffs since we are polishing with the short reads.

Great, let's proceed with Q20 for now, then.

jmtsuji commented 10 months ago

Note that adapter trimming and quality trimming are currently non-optional. We can add the ability to toggle these on/off if you would like.

This should be implemented. However, I would like to merge the QC code into develop beforehand.

@LeeBergstrand FYI, I had a bit of extra time, so I went ahead and added the ability to toggle adapter and quality trimming on/off into the code. Basically:

LeeBergstrand commented 10 months ago

Improve by putting contaminant NCBI IDs in the config and also a list of paths to custom contaminant FASTA. As discussed below:

image

jmtsuji commented 9 months ago

Improve by putting contaminant NCBI IDs in the config and also a list of paths to custom contaminant FASTA.

Done! Let me know your thoughts on the updated code.

Aside from addressing the core changes you requested, some other things were also added:

LeeBergstrand commented 9 months ago

@jmtsuji I'm working on an elegant solution to the short-read contaminant set up. It should address most of my review comments. See: https://github.com/jmtsuji/rotary/pull/82

jmtsuji commented 9 months ago

@LeeBergstrand Thanks for the proposed revision! I just added my comments at #82

jmtsuji commented 9 months ago

@LeeBergstrand I think this PR is only waiting on #82, correct? If so, then feel free to merge this PR after finishing the #82 merge. Otherwise, feel free to keep this PR open if you'd like to do another code review.