Closed mikemc closed 1 week ago
Yeah, happy to add this. I agree we probably should be making use of this information somehow.
@willbradshaw checking if there anything I can do to help get this implemented before we re-run the pipeline on new datasets (so that we have access to this info in the results). I'm guessing you want to test that it doesn't break anything downstream? Perhaps we can implement a test based on the test dataset and a truncated set of reads, where we confirm that the pipeline runs with no errors and the stats are identical.
I agree it would be good to implement this soon. Would you be up for making a branch and working on this?
On Fri, 28 Jun 2024 at 10:54, Michael McLaren @.***> wrote:
@willbradshaw https://github.com/willbradshaw checking if there anything I can do to help get this implemented before we re-run the pipeline on new datasets (so that we have access to this info in the results). I'm guessing you want to test that it doesn't break anything downstream? Perhaps we can implement a test based on the test dataset and a truncated set of reads, where we confirm that the pipeline runs with no errors and the stats are identical.
— Reply to this email directly, view it on GitHub https://github.com/naobservatory/mgs-workflow/issues/26#issuecomment-2197119795, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADWVV6W7PBJBUIFSUCLJ2GDZJV2JFAVCNFSM6AAAAABJJAPYMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXGEYTSNZZGU . You are receiving this because you were mentioned.Message ID: @.***>
Yes, I'll try this week. The main potential block is that I need to get going with AWS Batch first so that I can run the entire pipeline (currently I've just been running through the cleaning stages).
This was implemented in v2.3.0.
Picking up on a discussion in Twist, clumpify has the option of tracking the original count associated with a deduplicated sequence. Doing so is useful for downstream analysis, for example
The change to the workflow is simply to add
addcount=t
to https://github.com/naobservatory/mgs-workflow/blob/8cae2dca2574818797158e9baeace7a99f0c698c/workflows/main.nf#L576 with the new line beingThe counts are added to the fastq headers and look like
For reads that did not have duplicates, the header simply omits the "copies=2" part and uses the original read header line. I'm not sure if 2 is the original abundance of this sequence, such that reads with no 'copies=?' effectively have 'copies=1', or if it is the number of dropped 'duplicate' copies. The docs say "Append the number of copies to the read name."
The question is whether this is something we should do by defaul, support as an option in the pipeline, or simply leave the current behavior. Currently the pipeline does not use these counts, and the deduped files are not even put in the results folder, so that suggests there's no reason to add the counts by default. On the other hand, we very plausibly will want to use these counts in the pipeline (e.g. for computing duplication rates in rRNA and non-rRNA) and it shouldn't make a significant increase in file sizes.
If we keep the current behavior, we could add a comment in the 'DEDUP_CLUMPIFY' process or add a note to a 'Tips and tricks' wiki page or similar to document this feature.