Error with big dataset and slurm

max-hence commented 2 weeks ago

Dear snparcher developers,

I encounter errors when I run the pipeline on real size datasets. I get this kind of message during bam2gvcf, gvcf2DB, DB2vcf or concat_gvcfs.

Error in rule DB2vcf:
    message: SLURM-job '12919779' failed, SLURM status is: 'FAILED'For further error details see the cluster/cloud log and the log files of the involved rule(s).
    jobid: 2590
    input: results/GCA_015227805.2/genomics_db_import/DB_L0224.tar, results/GCA_015227805.2/data/genome/GCA_015227805.2.fna, results/GCA_015227805.2/data/genome/GCA_015227805.2.fna.fai, results/GCA_015227805.2/data/genome/GCA_015227805.2.dict
    output: results/GCA_015227805.2/vcfs/intervals/L0224.vcf.gz, results/GCA_015227805.2/vcfs/intervals/L0224.vcf.gz.tbi
    log: logs/GCA_015227805.2/gatk_genotype_gvcfs/0224.txt, /scratch/mbrault/snpcalling/hrustica/.snakemake/slurm_logs/rule_DB2vcf/GCA_015227805.2_0224/12919779.log (check log file(s) for error details)
    conda-env: /scratch/mbrault/snpcalling/hrustica/.snakemake/conda/040e922e8494c7bc027131fb77bc2d6d_
    shell:

        tar -xf results/GCA_015227805.2/genomics_db_import/DB_L0224.tar
        gatk GenotypeGVCFs             --java-options '-Xmx180000m -Xms180000m'             -R results/GCA_015227805.2/data/genome/GCA_015227805.2.fna             --heterozygosity 0.005             --genomicsdb-shared-posixfs-optimizations true             -V gendb://results/GCA_015227805.2/genomics_db_import/DB_L0224             -O results/GCA_015227805.2/vcfs/intervals/L0224.vcf.gz             --tmp-dir <TBD> &> logs/GCA_015227805.2/gatk_genotype_gvcfs/0224.txt

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
    external_jobid: 12919779

Trying to restart job 2590.

Here are logs linked to this job :

/scratch/mbrault/snpcalling/hrustica/.snakemake/slurm_logs/rule_DB2vcf/GCA_015227805.2_0224/12919779.log : example_job12919779.log
logs/GCA_015227805.2/gatk_genotype_gvcfs/0224.txt example_0224.txt
full log full_log.txt
config file config.txt

I gave you here an example with a detailed log but most of the time, for bam2gvcf or concat_gvcfs rules, logs are empty and I can't find any clue to understand the error.

At first sight I would have said that it's a memory error cause the job can restart and work the second or third time. But sometimes even with a huge amount of memory the job doesn't work. And what worries me is that, I recently tried with an other cluster that has a more recent slurm version and, even if I have the same errors, after the 2nd or 3rd try jobs end up being successul and the pipeline runs until the end.

My main question is then : Is your pipeline set up with a specific version of slurm ? Or do I need to better set "minNmer", "num_gvcf_intervals", "db_scatter_factor" parameters to improve the handling of big dataset ?

Tell me if you need more informations.

Thanks a lot !

Maxence Brault

cademirch commented 2 weeks ago

Hi Maxence,

Thanks for opening such a detailed issue.

I think it is unlikely that SLURM versions are the culprit here. Given you got the workflow to work on a different cluster I suspect it could be related to the tmpdir setting. You should refer to your cluster admins/docs to see where they suggest is the best place to write temp files on your cluster.

Its also possible something is wrong with how resources are being specified. It seems like your using mem_mb_per_cpu to specify memory, however unless you've modified the workflow rules to use that resource, then they still might be using just mem_mb.

max-hence commented 2 weeks ago

Hi Cade,

Thanks a lot for your quick answer. It's good to know better where the problem comes from. I ll ask cluster admins, if they have an explanation. However I doubt it comes form the tmpdir as all my pipeline runs in a /scratch directory made to handle heavy temporary files.

Yes on the cluster I'm using, the argument mem_mb renders this error : srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

I bypassed the error by replacing mem_mb with mem_mb_per_cpu but it may have made things worse. But it is something I don't have on an other slurm cluster.

Do you think setting minNmer, num_gvcf_intervals, db_scatter_factor can improve also the way memory is handled between jobs and cluster nodes ? If so, do you have recommendations of good values ?

Thanks again,

Maxence Brault

tsackton commented 2 weeks ago

The default temporary directory is not where the workflow runs, it is whatever your system settings are, which is probably /tmp on the compute node. I notice that you don't have anything set for the "bigtmp" option in your config. You might try setting this to snpArcher-tmp/ or something similar (note no leading slash, so it is created as a directory in the working directory you run the command from).

There might also be memory issues with using mem_mb_per_cpu. Can you share your slurm profile file? This is where these parameters would be set and might help us see if there are specific problems.

max-hence commented 2 weeks ago

Thank you for your help. I ll try setting the bigtmp option.

Here is the slurm profile file : slurm_profile.txt

Earlier, I had changed mem_mb for mem_mb_per_cpuand I think most of the error were coming from here but it seems that just adding --mem=<n>G to the main sbatch command solved my incompatibility problem between mem_mb_per_cpuand mem_mb arguments...

harvardinformatics / snpArcher

Error with big dataset and slurm #198