naobservatory / mgs-workflow

MIT License
3 stars 3 forks source link

Adding `addcount=t` to clumpify call in dedup step #26

Closed mikemc closed 1 week ago

mikemc commented 3 months ago

Picking up on a discussion in Twist, clumpify has the option of tracking the original count associated with a deduplicated sequence. Doing so is useful for downstream analysis, for example

The change to the workflow is simply to add addcount=t to https://github.com/naobservatory/mgs-workflow/blob/8cae2dca2574818797158e9baeace7a99f0c698c/workflows/main.nf#L576 with the new line being

par="reorder dedupe containment addcount=t t=!{task.cpus} -Xmx30g"

The counts are added to the fastq headers and look like

@VH01619:51:AAFNF5JM5:1:1101:42279:5581:CGATTGGCT 1:N:0:TACGCTAC+CGTGTGAT copies=2

For reads that did not have duplicates, the header simply omits the "copies=2" part and uses the original read header line. I'm not sure if 2 is the original abundance of this sequence, such that reads with no 'copies=?' effectively have 'copies=1', or if it is the number of dropped 'duplicate' copies. The docs say "Append the number of copies to the read name."

The question is whether this is something we should do by defaul, support as an option in the pipeline, or simply leave the current behavior. Currently the pipeline does not use these counts, and the deduped files are not even put in the results folder, so that suggests there's no reason to add the counts by default. On the other hand, we very plausibly will want to use these counts in the pipeline (e.g. for computing duplication rates in rRNA and non-rRNA) and it shouldn't make a significant increase in file sizes.

If we keep the current behavior, we could add a comment in the 'DEDUP_CLUMPIFY' process or add a note to a 'Tips and tricks' wiki page or similar to document this feature.

willbradshaw commented 3 months ago

Yeah, happy to add this. I agree we probably should be making use of this information somehow.

mikemc commented 3 months ago

@willbradshaw checking if there anything I can do to help get this implemented before we re-run the pipeline on new datasets (so that we have access to this info in the results). I'm guessing you want to test that it doesn't break anything downstream? Perhaps we can implement a test based on the test dataset and a truncated set of reads, where we confirm that the pipeline runs with no errors and the stats are identical.

willbradshaw commented 3 months ago

I agree it would be good to implement this soon. Would you be up for making a branch and working on this?

On Fri, 28 Jun 2024 at 10:54, Michael McLaren @.***> wrote:

@willbradshaw https://github.com/willbradshaw checking if there anything I can do to help get this implemented before we re-run the pipeline on new datasets (so that we have access to this info in the results). I'm guessing you want to test that it doesn't break anything downstream? Perhaps we can implement a test based on the test dataset and a truncated set of reads, where we confirm that the pipeline runs with no errors and the stats are identical.

— Reply to this email directly, view it on GitHub https://github.com/naobservatory/mgs-workflow/issues/26#issuecomment-2197119795, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADWVV6W7PBJBUIFSUCLJ2GDZJV2JFAVCNFSM6AAAAABJJAPYMKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJXGEYTSNZZGU . You are receiving this because you were mentioned.Message ID: @.***>

mikemc commented 2 months ago

Yes, I'll try this week. The main potential block is that I need to get going with AWS Batch first so that I can run the entire pipeline (currently I've just been running through the cleaning stages).

willbradshaw commented 1 week ago

This was implemented in v2.3.0.