Feature/downsample - Githubissues

vdblab / vdblab-shotgun

Shotgun metagenomic sequencing processing pipeline

MIT License

1 stars 1 forks source link

Feature/downsample #37

Closed nickp60 closed 1 year ago

nickp60 commented 1 year ago

This adds the downsampling rule that takes config parameters for depth and replicate. The replicate gets used to seed the downsampling. Depth is provided as the number of total reads desired. This follows seqkit's recommendation to sample a proportion rather than an exact number of reads in order to avoid reading the whole file into memory. So this calculates the number of reads in the sample, calculates a generous downsampling proportion, samples that proportion, and selects the exact number or reads with seqkit head.

funnell commented 1 year ago

what do you think about putting the two main rules into a rule group? https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#defining-groups-for-execution

nickp60 commented 1 year ago

I haven't worked with job groups, and I am a bit hesitant to add additional complexity. Since our primary use case is HPC with a shared file system I don't think we need to optimize. Would totally make sense on AWS or something, to avoid transferring the fastqs to a new VM. Feel free to add it as a milestone for version 2.0.

funnell commented 1 year ago

job groups aren't necessary, for our purposes they would just group rules into the same LSF job in order to reduce scheduler overhead. Could be helpful for when the cluster is busy.

Feel free to merge if you feel there's no issue with re-using seqkit temp files 👍

nickp60 commented 1 year ago

I've added an issue for the job groups. https://github.com/vdblab/vdblab-shotgun/issues/39