snakemake / snakemake

This is the development home of the workflow management system Snakemake. For general information, see
https://snakemake.github.io
MIT License
2.23k stars 542 forks source link

FTP.remote is slow when building DAG #373

Open mevers opened 4 years ago

mevers commented 4 years ago

This question is a follow-up/bump to unresolved issue #1275 on BitBucket, which I cannot access anymore (old link).

Issue

FTP remote slows down DAG building. This happens every time regardless of whether the file has already been downloaded or not. Running snakemake with --debug-dag suggests that FTP.remote slows down the DAG building step.

I'm curious to hear what the recommended way is to download files through FTP. An alternative could be the use of wget or curl but snakemake.remote.FTP seems to be the more snakemake-canonical approach.

Minimal example

Consider the following Snakefile which downloads two chromosome sequences from Ensembl through FTP

from os.path import join
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider

url = "ftp://ftp.ensembl.org/pub/release-99/fasta/mus_musculus/dna"
files = [
    "Mus_musculus.GRCm38.dna.chromosome.1.fa.gz",
    "Mus_musculus.GRCm38.dna.chromosome.2.fa.gz"]

FTP = FTPRemoteProvider()

rule all:
    input: expand("downloads/{file}", file = files)

rule download_file:
    input: FTP.remote(join(url, "{file}"))
    output: "downloads/{file}"
    shell: "mv {input} {output}"

Running the minimal workflow with --debug-dag

snakemake --debug-dag
Building DAG of jobs...
candidate job all
    wildcards:
candidate job download_file
    wildcards: file=Mus_musculus.GRCm38.dna.chromosome.1.fa.gz
selected job download_file
    wildcards: file=Mus_musculus.GRCm38.dna.chromosome.1.fa.gz
candidate job download_file
    wildcards: file=Mus_musculus.GRCm38.dna.chromosome.2.fa.gz
selected job download_file
    wildcards: file=Mus_musculus.GRCm38.dna.chromosome.2.fa.gz
selected job all
    wildcards:
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   all
    2   download_file
    3

[Tue May  5 16:55:12 2020]
rule download_file:
    input: ftp.ensembl.org/pub/release-99/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.1.fa.gz
    output: downloads/Mus_musculus.GRCm38.dna.chromosome.1.fa.gz
    jobid: 1
    wildcards: file=Mus_musculus.GRCm38.dna.chromosome.1.fa.gz

Downloading from remote: ftp.ensembl.org/pub/release-99/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.1.fa.gz
Finished download.
[Tue May  5 16:55:55 2020]
Finished job 1.
1 of 3 steps (33%) done

[Tue May  5 16:55:55 2020]
rule download_file:
    input: ftp.ensembl.org/pub/release-99/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.2.fa.gz
    output: downloads/Mus_musculus.GRCm38.dna.chromosome.2.fa.gz
    jobid: 2
    wildcards: file=Mus_musculus.GRCm38.dna.chromosome.2.fa.gz

Downloading from remote: ftp.ensembl.org/pub/release-99/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.2.fa.gz
Finished download.
[Tue May  5 16:56:58 2020]
Finished job 2.
2 of 3 steps (67%) done

[Tue May  5 16:56:58 2020]
localrule all:
    input: downloads/Mus_musculus.GRCm38.dna.chromosome.1.fa.gz, downloads/Mus_musculus.GRCm38.dna.chromosome.2.fa.gz
    jobid: 0

[Tue May  5 16:56:58 2020]
Finished job 0.
3 of 3 steps (100%) done

snakemake version

snakemake --version
5.7.1
jtpoirier commented 3 years ago

This bug also appears to affect S3.remote in version 6.0.5.

xguse commented 3 years ago

This is slightly old but I feel that I should mention that I would not set up a job that will run frequently to pull data from ensembl every time. You should pull it once and use it often. We should try to respect their bandwidth. Perhaps pull it to an S3 bucket and hit that over and over?

mevers commented 3 years ago

@xguse The whole point of incorporating a rule into snakemake to download data from external resources is to be mindful of bandwidth: only download once and in the presence of upstream changes. As such, I don't see the point of suggesting an S3 bucket. If anything this introduces the need for and dependence on another API-specific process.