qiime2 / q2-dada2

QIIME 2 plugin wrapping DADA2
BSD 3-Clause "New" or "Revised" License
19 stars 36 forks source link

add support for Pacbio CCS reads #135

Closed sixvable closed 2 years ago

sixvable commented 3 years ago

Add support for Pacbio CCS reads.

101

Test on qiime2-2020.11.

benjjneb commented 3 years ago

Thanks for working on this @sixvable

I skimmed the code and two questions come to mind immediately.

1) Are the fraction of reads that pass primer removal included in the final tracking data.frame that can be reported out to the Q2 user? This is important information.

2) This is an entirely new script, but one that heavily resembles denoised-single. Long-term, does it make more sense to make a new script like this, or to leverage denoise-single with the differences fed in as options to that script?

sixvable commented 3 years ago

Thank you @benjjneb

  1. Are the fraction of reads that pass primer removal included in the final tracking data.frame that can be reported out to the Q2 user? This is important information.

Yes, the number and percentage of the reads that pass the primer remove step have been include in the track file, and can be check in the final denoising_stats.qzv

2. This is an entirely new script, but one that heavily resembles denoised-single. Long-term, does it make more sense to make a new script like this, or to leverage denoise-single with the differences fed in as options to that script?

Since there is a new primer remove step before the trim and filter step in the denoise-ccs function, I had to re-arrange the sort numbers of the entire run_dada_single.R script. I also exposed 3 new parameters in the new run_dada_ccs.R script: maxMismatch, indels, and minLen. And I also change the default value of minQ. If we directly put this part into run_dada_single.R, we may need to give more ifelse code and the script could also become less readable or maintainable.

benjjneb commented 3 years ago

Yes, the number and percentage of the reads that pass the primer remove step have been include in the track file, and can be check in the final denoising_stats.qzv

Nice!

Since there is a new primer remove step before the trim and filter step in the denoise-ccs function, I had to re-arrange the sort numbers of the entire run_dada_single.R script. I also exposed 3 new parameters in the new run_dada_ccs.R script: maxMismatch, indels, and minLen. And I also change the default value of minQ. If we directly put this part into run_dada_single.R, we may need to give more ifelse code and the script could also become less readable or maintainable.

That seems reasonable. I just wanted to bring this issue up, as longer term maintainability is a major issue (although I'm sure that's not news to the Q2 team).

benjjneb commented 3 years ago

Another Q: Have you identified a PacBio CCS dataset for the integrated testing?

sixvable commented 3 years ago

Another Q: Have you identified a PacBio CCS dataset for the integrated testing?

There are not many benchmarks on PacBio CCS dataset using DADA2 so I only test several datasets in your paper. The q2 version gives the exactly same results compared to the R version. So I think it is worked.

benjjneb commented 3 years ago

That seems like good confirmation it is working as expected. For the integrated testing, it might be best to downsample a single lower-diversity sample (maybe the Zymo mock) just to keep computational expenditures lower.

sixvable commented 3 years ago

Zymo mock works well too. I will leave it to q2 team for further test.

ebolyen commented 2 years ago

@Keegan-Evans, looks like you are missing an entry in package_data in setup.py to include your test data directory.

Keegan-Evans commented 2 years ago

Basic tests and updating to current q2-dada2 version done. Data for testing is 100 sequence subsamples from the human fecal replicate 1 samples from @benjjneb's paper.