Closed nickp60 closed 1 year ago
what do you think about putting the two main rules into a rule group? https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#defining-groups-for-execution
I haven't worked with job groups, and I am a bit hesitant to add additional complexity. Since our primary use case is HPC with a shared file system I don't think we need to optimize. Would totally make sense on AWS or something, to avoid transferring the fastqs to a new VM. Feel free to add it as a milestone for version 2.0.
job groups aren't necessary, for our purposes they would just group rules into the same LSF job in order to reduce scheduler overhead. Could be helpful for when the cluster is busy.
Feel free to merge if you feel there's no issue with re-using seqkit temp files 👍
I've added an issue for the job groups. https://github.com/vdblab/vdblab-shotgun/issues/39
This adds the downsampling rule that takes config parameters for depth and replicate. The replicate gets used to seed the downsampling. Depth is provided as the number of total reads desired. This follows seqkit's recommendation to sample a proportion rather than an exact number of reads in order to avoid reading the whole file into memory. So this calculates the number of reads in the sample, calculates a generous downsampling proportion, samples that proportion, and selects the exact number or reads with
seqkit head
.