rvalieris / parallel-fastq-dump

parallel fastq-dump wrapper
MIT License
265 stars 33 forks source link

no parallelize with snakemake #56

Open Fadwa7 opened 5 months ago

Fadwa7 commented 5 months ago

Hi, I am working on a pipeline that downloads SRR files, and to optimize the download time, I opted for your tool. However, I am encountering an issue. I have a list of 143 files, and when I run this command:

rule fetch_fastq:

output:
    config["RESULTS"] + "Fastq_Files/{sra}.fastq.gz"
log:
    config["RESULTS"] + "Supplementary_Data/Logs/{sra}.sratoolkit.log"
benchmark:
    config["RESULTS"] + "Supplementary_Data/Benchmark/{sra}.sratoolkit.txt"
message:
   "fetch fastq from NCBI"
params:
   conda = "sratoolkit",
   outdir = config["RESULTS"] + "Fastq_Files"
threads: 16
shell:
    """
    set +eu &&
    . $(conda info --base)/etc/profile.d/conda.sh &&
    conda activate {params.conda}
    parallel-fastq-dump \
        --outdir {params.outdir} \
        --gzip \
        --sra-id {wildcards.sra} \
        --threads {threads}
    """

`

And I launch Snakemake with snakemake -s snakefile --cores 4, it processes all files in batches of 4 until it finishes executing the first rule, then it moves to the second rule. However, I want it to execute all rules in the Snakefile on the first 4 files, then move to the next 4 files, and so on.

Do you have any solutions? Thank you in advance.

rvalieris commented 5 months ago

However, I want it to execute all rules in the Snakefile on the first 4 files, then move to the next 4 files, and so on.

Hello, yeah I been through this before, basically you want snakemake to execute depth first, here are some links to read more: https://github.com/snakemake/snakemake/issues/2595 https://stackoverflow.com/questions/67332350/snakemake-priorities-that-one-sample-finishes-before-next-starts https://stackoverflow.com/questions/64173399/snakemake-tranverse-dag-depth-first https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#priorities

Fadwa7 commented 5 months ago

Thank you so much. Can you please explain why parallel-fastq-dump takes the number of cores (--cores) when I run snakemake -s snakefile --cores instead of taking the number of threads I put in the rule? For example, if I put 16 threads in the fetch_sra rule, when I execute snakemake -s snakefile --cores 2, does parallel-fastq-dump split the SRA into 2 chunks?

Thank you in advance

rvalieris commented 5 months ago

yeah, the number of threads you set on the rule is just a "maximum", since you set --cores 4 on the command line snakemake changes the threads the rules can use to 4, read more here:

https://snakemake.readthedocs.io/en/latest/snakefiles/rules.html#threads

Fadwa7 commented 5 months ago

Thank you so much