Open max-hence opened 2 weeks ago
Hi Maxence,
Thanks for opening such a detailed issue.
I think it is unlikely that SLURM versions are the culprit here. Given you got the workflow to work on a different cluster I suspect it could be related to the tmpdir
setting. You should refer to your cluster admins/docs to see where they suggest is the best place to write temp files on your cluster.
Its also possible something is wrong with how resources are being specified. It seems like your using mem_mb_per_cpu
to specify memory, however unless you've modified the workflow rules to use that resource, then they still might be using just mem_mb
.
Hi Cade,
Thanks a lot for your quick answer. It's good to know better where the problem comes from. I ll ask cluster admins, if they have an explanation. However I doubt it comes form the tmpdir
as all my pipeline runs in a /scratch
directory made to handle heavy temporary files.
Yes on the cluster I'm using, the argument mem_mb
renders this error :
srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.
I bypassed the error by replacing mem_mb with mem_mb_per_cpu but it may have made things worse. But it is something I don't have on an other slurm cluster.
Do you think setting minNmer
, num_gvcf_intervals
, db_scatter_factor
can improve also the way memory is handled between jobs and cluster nodes ? If so, do you have recommendations of good values ?
Thanks again,
Maxence Brault
The default temporary directory is not where the workflow runs, it is whatever your system settings are, which is probably /tmp
on the compute node. I notice that you don't have anything set for the "bigtmp" option in your config. You might try setting this to snpArcher-tmp/
or something similar (note no leading slash, so it is created as a directory in the working directory you run the command from).
There might also be memory issues with using mem_mb_per_cpu. Can you share your slurm profile file? This is where these parameters would be set and might help us see if there are specific problems.
Thank you for your help. I ll try setting the bigtmp option.
Here is the slurm profile file : slurm_profile.txt
Earlier, I had changed mem_mb
for mem_mb_per_cpu
and I think most of the error were coming from here but it seems that just adding --mem=<n>G
to the main sbatch command solved my incompatibility problem between mem_mb_per_cpu
and mem_mb
arguments...
Dear snparcher developers,
I encounter errors when I run the pipeline on real size datasets. I get this kind of message during bam2gvcf, gvcf2DB, DB2vcf or concat_gvcfs.
Here are logs linked to this job :
/scratch/mbrault/snpcalling/hrustica/.snakemake/slurm_logs/rule_DB2vcf/GCA_015227805.2_0224/12919779.log : example_job12919779.log
logs/GCA_015227805.2/gatk_genotype_gvcfs/0224.txt example_0224.txt
full log full_log.txt
config file config.txt
I gave you here an example with a detailed log but most of the time, for bam2gvcf or concat_gvcfs rules, logs are empty and I can't find any clue to understand the error.
At first sight I would have said that it's a memory error cause the job can restart and work the second or third time. But sometimes even with a huge amount of memory the job doesn't work. And what worries me is that, I recently tried with an other cluster that has a more recent slurm version and, even if I have the same errors, after the 2nd or 3rd try jobs end up being successul and the pipeline runs until the end.
My main question is then : Is your pipeline set up with a specific version of slurm ? Or do I need to better set "minNmer", "num_gvcf_intervals", "db_scatter_factor" parameters to improve the handling of big dataset ?
Tell me if you need more informations.
Thanks a lot !
Maxence Brault