FileExistsError in `medaka smolecule` upon execution of the pipeline with the example data

camcl opened 2 months ago

camcl commented 2 months ago


I have been interested in using your pipeline with ONT long-read data with UMIs that are generated by colleagues in my lab. To begin with, I tried to execute the pipeline with the example data that are provided in the repository. I followed the instructions provided in the README, that is:

  1. Clone the repository: git clone

  2. Navigate to the cloned repository and finish the configuration and installation. I used the latest miniconda3:

    cd pipeline-umi-amplicon
    conda env create -f environment.yml
    conda activate pipeline-umi-amplicon
    cd lib && pip install . && cd ..

    This ran without error and I have the following components in the conda environment:

  3. Testing the installation with snakemake -j 1 -pr --configfile config.yml does not produce any error:

    Targets: EGFR_917
    Building DAG of jobs...
    Using shell: /usr/bin/bash
    Provided cores: 1 (use --cores to define parallelism)
    Rules claiming more threads will be scaled down.
    Job stats:
    job                   count
    ------------------  -------
    copy_bed                  1
    reads                     1
    seqkit_bam_acc_tsv        1
    total                     3

Select jobs to execute...

[Tue Sep 17 16:20:28 2024] rule copy_bed: input: data/example_egfr_amplicon.bed output: example_egfr_single_read_run/targets.bed jobid: 1 reason: Missing output files: example_egfr_single_read_run/targets.bed wildcards: name=example_egfr_single_read_run resources: tmpdir=/tmp

cp data/example_egfr_amplicon.bed example_egfr_single_read_run/targets.bed [Tue Sep 17 16:20:28 2024] Finished job 1. 1 of 3 steps (33%) done Select jobs to execute...

[Tue Sep 17 16:20:28 2024] rule seqkit_bam_acc_tsv: input: example_egfr_single_read_run/align/EGFR_917_consensus.bam output: example_egfr_single_read_run/stats/EGFR_917_consensus_size_vs_acc.tsv jobid: 13 reason: Missing output files: example_egfr_single_read_run/stats/EGFR_917_consensus_size_vs_acc.tsv wildcards: name=example_egfr_single_read_run, target=EGFR_917, stage=consensus resources: tmpdir=/tmp

    echo -e "Read   Cluster_size    Ref MapQual Acc ReadLen RefLen  RefAln  RefCov  ReadAln ReadCov Strand  MeanQual    LeftClip    RightClip   Flags   IsSec   IsSup" > example_egfr_single_read_run/stats/EGFR_917_consensus_size_vs_acc.tsv && seqkit bam example_egfr_single_read_run/align/EGFR_917_consensus.bam 2>&1 | sed 's/_/ /' | tail -n +2 >> example_egfr_single_read_run/stats/EGFR_917_consensus_size_vs_acc.tsv

[Tue Sep 17 16:20:29 2024] Finished job 13. 2 of 3 steps (67%) done Select jobs to execute...

[Tue Sep 17 16:20:29 2024] localrule reads: input: example_egfr_single_read_run/targets.bed, example_egfr_single_read_run/align/EGFR_917_final.bam.bai, example_egfr_single_read_run/stats/EGFR_917_vsearch_cluster_stats.tsv, example_egfr_single_read_run/stats/EGFR_917_consensus_size_vs_acc.tsv jobid: 0 reason: Input files updated by another job: example_egfr_single_read_run/stats/EGFR_917_consensus_size_vs_acc.tsv, example_egfr_single_read_run/targets.bed resources: tmpdir=/tmp

[Tue Sep 17 16:20:29 2024] Finished job 0. 3 of 3 steps (100%) done Complete log: .snakemake/log/2024-09-17T162028.348925.snakemake.log

4. Without editing anything in `config.yml`, I ran the command `snakemake -j 30 reads --configfile config.yml`. All steps until the rule `polish clusters` complete but the execution terminates upon polishing with the following output:

[Tue Sep 17 16:42:44 2024] Error in rule polish_clusters: jobid: 6 input: example_egfr_single_read_run/clustering/EGFR_917/clusters_fa, example_egfr_single_read_run/clustering/EGFR_917/smolecule_clusters.fa output: example_egfr_single_read_run/fasta/EGFR_917_consensus_tmp, example_egfr_single_read_run/fasta/EGFR_917_consensus.bam, example_egfr_single_read_run/fasta/EGFR_917_consensus.fasta shell:

rm -rf example_egfr_single_read_run/fasta/EGFR_917_consensus_tmp
medaka smolecule --threads 30 --length 50 --depth 2 --model r941_min_high_g360 --method spoa example_egfr_single_read_run/clustering/EGFR_917/smolecule_clusters.fa example_egfr_single_read_run/fasta/EGFR_917_consensus_tmp 2> example_egfr_single_read_run/fasta/EGFR_917_consensus.bam_smolecule.log
cp example_egfr_single_read_run/fasta/EGFR_917_consensus_tmp/consensus.fasta example_egfr_single_read_run/fasta/EGFR_917_consensus.fasta
cp example_egfr_single_read_run/fasta/EGFR_917_consensus_tmp/subreads_to_spoa.bam example_egfr_single_read_run/fasta/EGFR_917_consensus.bam && cp example_egfr_single_read_run/fasta/EGFR_917_consensus_tmp/subreads_to_spoa.bam.bai example_egfr_single_read_run/fasta/EGFR_917_consensus.bam.bai

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2024-09-17T164240.824646.snakemake.log

The contents of the file `example_egfr_single_read_run/fasta/EGFR_917_consensus.bam_smolecule.log` provide more information about the error:

Traceback (most recent call last): File "~/miniconda3/envs/pipeline-umi-amplicon/bin/medaka", line 11, in sys.exit(main()) File "~/miniconda3/envs/pipeline-umi-amplicon/lib/python3.8/site-packages/medaka/", line 814, in main args.func(args) File "~/miniconda3/envs/pipeline-umi-amplicon/lib/python3.8/site-packages/medaka/", line 429, in main medaka.common.mkdir_p(args.output, info='Results will be overwritten.') File "~/miniconda3/envs/pipeline-umi-amplicon/lib/python3.8/site-packages/medaka/", line 763, in mkdir_p os.makedirs(path) File "~/miniconda3/envs/pipeline-umi-amplicon/lib/python3.8/", line 223, in makedirs mkdir(name, mode) FileExistsError: [Errno 17] File exists: 'example_egfr_single_read_run/clustering/EGFR_917/smolecule_clusters.fa'

What have I done wrong?


Camille C.
camcl commented 2 months ago

Fixes here: