harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

issue with test run #154

Closed vitorpavinato closed 4 months ago

vitorpavinato commented 4 months ago

Hi,

I got this error message when I tried to run the test code:

snakemake -d .test/ecoli --cores 1 --use-conda
[Tue Jan 30 16:31:01 2024]
Error in rule create_db_intervals:
    jobid: 21
    input: results/GCA_003018455.1/data/genome/GCA_003018455.1.fna, results/GCA_003018455.1/data/genome/GCA_003018455.1.fna.fai, results/GCA_003018455.1/data/genome/GCA_003018455.1.dict, results/GCA_003018455.1/intervals/master_interval_list.list
    output: results/GCA_003018455.1/intervals/db_intervals/intervals.txt, results/GCA_003018455.1/intervals/db_intervals
    log: logs/GCA_003018455.1/db_intervals/log.txt (check log file(s) for error details)
    conda-env: /fs/scratch/PAS1554/snpArcher/.test/ecoli/.snakemake/conda/ec2d2883921c842412450a0289e25d36_
    shell:

        gatk SplitIntervals -L results/GCA_003018455.1/intervals/master_interval_list.list         -O results/GCA_003018455.1/intervals/db_intervals -R results/GCA_003018455.1/data/genome/GCA_003018455.1.fna -scatter 1         -mode INTERVAL_SUBDIVISION         --interval-merging-rule OVERLAPPING_ONLY &> logs/GCA_003018455.1/db_intervals/log.txt
        ls -l results/GCA_003018455.1/intervals/db_intervals/*scattered.interval_list > results/GCA_003018455.1/intervals/db_intervals/intervals.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job create_db_intervals since they might be corrupted:
results/GCA_003018455.1/intervals/db_intervals
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-01-30T163042.968256.snakemake.log

Thanks

cademirch commented 4 months ago

Hi @vitorpavinato, sorry for the trouble running the test dataset. Could you provide your Snakemake version, as well as the output from the log file specified in the your post?

vitorpavinato commented 4 months ago

Hi @cademirch,

Sure, here is the snakemake version: 7.32.4 Where I can find the log file?

I am using a SLURM cluster and I ran on scratch. I try to find the file with

find / -n 2024-01-30T163042.968256.snakemake.log 2>/dev/null

But I didn't return anything.

Thanks for the prompt response.

cademirch commented 4 months ago

The log files should be in the .test/ecoli directory within the snparcher directory.

On Tue, Jan 30, 2024 at 17:13 Vitor Pavinato @.***> wrote:

Hi @cademirch https://github.com/cademirch,

Sure, here is the snakemake version: 7.32.4 Where I can find the log file?

I am using a SLURM cluster and I ran on scratch. I try to find the file with

find / -n 2024-01-30T163042.968256.snakemake.log 2>/dev/null

But I didn't return anything.

Thanks for the prompt response.

— Reply to this email directly, view it on GitHub https://github.com/harvardinformatics/snpArcher/issues/154#issuecomment-1918185330, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKVQJ4UDFWU3N6NVBRBZ66LYRGLFBAVCNFSM6AAAAABCSDQJBGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJYGE4DKMZTGA . You are receiving this because you were mentioned.Message ID: @.***>

vitorpavinato commented 4 months ago

Great,

Here is what I got from the log file:

Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job                     count
--------------------  -------
all                         1
bam_sumstats                2
bwa_map                     1
callable_bed                2
collect_covstats            1
collect_fastp_stats         2
collect_sumstats            2
compute_d4                  1
create_cov_bed              2
create_db_intervals         2
dedup                       1
fastp                       1
format_interval_list        1
genmap                      1
index_reference             1
mappability_bed             1
merge_d4                    1
picard_intervals            1
sort_gatherVcfs             2
total                      26

Select jobs to execute...

[Tue Jan 30 16:30:52 2024]
rule fastp:
    input: data/local_fastq/my_sample1_1.fastq.gz, data/local_fastq/my_sample1_2.fastq.gz
    output: results/GCA_000008865.2/filtered_fastqs/SAMN12676327/SRR10058855_1.fastq.gz, results/GCA_000008865.2/filtered_fastqs/SAMN12676327/SRR10058855_2.fastq.gz, results/GCA_000008865.2/summary_stats/SAMN12676327/SRR10058855.fastp.out
    log: logs/GCA_000008865.2/fastp/SAMN12676327/SRR10058855.txt
    jobid: 11
    benchmark: benchmarks/GCA_000008865.2/fastp/SAMN12676327_SRR10058855.txt
    reason: Missing output files: results/GCA_000008865.2/filtered_fastqs/SAMN12676327/SRR10058855_2.fastq.gz, results/GCA_000008865.2/summary_stats/SAMN12676327/SRR10058855.fastp.out, results/GCA_000008865.2/filtered_fastqs/SAMN12676327/SRR10058855_1.fastq.gz
    wildcards: refGenome=GCA_000008865.2, sample=SAMN12676327, run=SRR10058855
    resources: tmpdir=/tmp, mem_mb=4000, mem_mib=3815

Activating conda environment: .snakemake/conda/f32d3b737d797443a34140e7912c58cd_
[Tue Jan 30 16:30:53 2024]
Finished job 11.
1 of 26 steps (4%) done
Select jobs to execute...

[Tue Jan 30 16:30:53 2024]
rule collect_fastp_stats:
    input: results/GCA_000008865.2/summary_stats/SAMN12676327/SRR10058855.fastp.out
    output: results/GCA_000008865.2/summary_stats/SAMN12676327_fastp.out
    jobid: 12
    reason: Missing output files: results/GCA_000008865.2/summary_stats/SAMN12676327_fastp.out; Input files updated by another job: results/GCA_000008865.2/summary_stats/SAMN12676327/SRR10058855.fastp.out
    wildcards: refGenome=GCA_000008865.2, sample=SAMN12676327
    resources: tmpdir=/tmp

[Tue Jan 30 16:30:53 2024]
Finished job 12.
2 of 26 steps (8%) done
Select jobs to execute...

[Tue Jan 30 16:30:53 2024]
checkpoint create_db_intervals:
    input: results/GCA_003018455.1/data/genome/GCA_003018455.1.fna, results/GCA_003018455.1/data/genome/GCA_003018455.1.fna.fai, results/GCA_003018455.1/data/genome/GCA_003018455.1.dict, results/GCA_003018455.1/intervals/master_interval_list.list
    output: results/GCA_003018455.1/intervals/db_intervals/intervals.txt, results/GCA_003018455.1/intervals/db_intervals
    log: logs/GCA_003018455.1/db_intervals/log.txt
    jobid: 21
    benchmark: benchmarks/GCA_003018455.1/db_intervals/benchmark.txt
    reason: Missing output files: results/GCA_003018455.1/intervals/db_intervals/intervals.txt
    wildcards: refGenome=GCA_003018455.1
    resources: tmpdir=/tmp, mem_mb=5000, mem_mib=4769
DAG of jobs will be updated after completion.

Activating conda environment: .snakemake/conda/ec2d2883921c842412450a0289e25d36_
[Tue Jan 30 16:31:01 2024]
Error in rule create_db_intervals:
    jobid: 21
    input: results/GCA_003018455.1/data/genome/GCA_003018455.1.fna, results/GCA_003018455.1/data/genome/GCA_003018455.1.fna.fai, results/GCA_003018455.1/data/genome/GCA_003018455.1.dict, results/GCA_003018455.1/intervals/master_interval_list.list
    output: results/GCA_003018455.1/intervals/db_intervals/intervals.txt, results/GCA_003018455.1/intervals/db_intervals
    log: logs/GCA_003018455.1/db_intervals/log.txt (check log file(s) for error details)
    conda-env: /fs/scratch/PAS1554/snpArcher/.test/ecoli/.snakemake/conda/ec2d2883921c842412450a0289e25d36_
    shell:

        gatk SplitIntervals -L results/GCA_003018455.1/intervals/master_interval_list.list         -O results/GCA_003018455.1/intervals/db_intervals -R results/GCA_003018455.1/data/genome/GCA_003018455.1.fna -scatter 1         -mode INTERVAL_SUBDIVISION         --interval-merging-rule OVERLAPPING_ONLY &> logs/GCA_003018455.1/db_intervals/log.txt
        ls -l results/GCA_003018455.1/intervals/db_intervals/*scattered.interval_list > results/GCA_003018455.1/intervals/db_intervals/intervals.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job create_db_intervals since they might be corrupted:
results/GCA_003018455.1/intervals/db_intervals
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-01-30T163042.968256.snakemake.log
cademirch commented 4 months ago

Thanks. Can you also paste in the split intervals log? Should be in .test/ecoli/logs

vitorpavinato commented 4 months ago

is this one found at .test/ecoli/logs/GCA_003018455.1/db_intervals ?

16:31:01.301 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/fs/scratch/PAS1554/snpArcher/.test/ecoli/.snakemake/conda/ec2d2883921c842412450a0289e25d36_/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jan 30, 2024 4:31:01 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
16:31:01.474 INFO  SplitIntervals - ------------------------------------------------------------
16:31:01.474 INFO  SplitIntervals - The Genome Analysis Toolkit (GATK) v4.1.8.0
16:31:01.474 INFO  SplitIntervals - For support and documentation go to https://software.broadinstitute.org/gatk/
16:31:01.474 INFO  SplitIntervals - Executing as vitorpavinato@owens-login04.hpc.osc.edu on Linux v3.10.0-1160.102.1.el7.x86_64 amd64
16:31:01.474 INFO  SplitIntervals - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_382-b05
16:31:01.475 INFO  SplitIntervals - Start Date/Time: January 30, 2024 4:31:01 PM EST
16:31:01.475 INFO  SplitIntervals - ------------------------------------------------------------
16:31:01.475 INFO  SplitIntervals - ------------------------------------------------------------
16:31:01.475 INFO  SplitIntervals - HTSJDK Version: 2.22.0
16:31:01.475 INFO  SplitIntervals - Picard Version: 2.22.8
16:31:01.475 INFO  SplitIntervals - HTSJDK Defaults.COMPRESSION_LEVEL : 2
16:31:01.475 INFO  SplitIntervals - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
16:31:01.475 INFO  SplitIntervals - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
16:31:01.475 INFO  SplitIntervals - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
16:31:01.476 INFO  SplitIntervals - Deflater: IntelDeflater
16:31:01.476 INFO  SplitIntervals - Inflater: IntelInflater
16:31:01.476 INFO  SplitIntervals - GCS max retries/reopens: 20
16:31:01.476 INFO  SplitIntervals - Requester pays: disabled
16:31:01.476 INFO  SplitIntervals - Initializing engine
16:31:01.785 INFO  SplitIntervals - Shutting down engine
[January 30, 2024 4:31:01 PM EST] org.broadinstitute.hellbender.tools.walkers.SplitIntervals done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=1164443648
***********************************************************************

A USER ERROR has occurred: Badly formed genome unclippedLoc: Query interval "CP027599.1 : 1 - 5942969" is not valid for this input.

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
Using GATK jar /fs/scratch/PAS1554/snpArcher/.test/ecoli/.snakemake/conda/ec2d2883921c842412450a0289e25d36_/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /fs/scratch/PAS1554/snpArcher/.test/ecoli/.snakemake/conda/ec2d2883921c842412450a0289e25d36_/share/gatk4-4.1.8.0-0/gatk-package-4.1.8.0-local.jar SplitIntervals -L results/GCA_003018455.1/intervals/master_interval_list.list -O results/GCA_003018455.1/intervals/db_intervals -R results/GCA_003018455.1/data/genome/GCA_003018455.1.fna -scatter 1 -mode INTERVAL_SUBDIVISION --interval-merging-rule OVERLAPPING_ONLY
cademirch commented 4 months ago

Thanks for posting that @vitorpavinato. Unfortunately, I cannot seem to recreate this. Could you let me know your python3 version? I suspect this may be the issue. For reference, I ran the same test on SLURM using python3==3.11.4.

vitorpavinato commented 4 months ago

Yes, sure. The conda environment I set has Python 3.12.1. I also should mention that I used conda instead of mamba in here:

mamba create -c conda-forge -c bioconda -n snparcher snakemake
mamba activate snparcher
cademirch commented 4 months ago

Could you try with Python 3.11.X? I believe this is an issue with Snakemake and fstrings in 3.12.

vitorpavinato commented 4 months ago

Hi @cademirch. Just to let you know the test worked with python=3.11.6. Consider setting a requirements-like file to enforce the python version that works with Snakemake and fstrings.

cademirch commented 4 months ago

Thanks for the update @vitorpavinato. We've pinned the python and snakemake version in the docs now.