marbl / verkko

265 stars 27 forks source link

Core Dump On SLURM #255

Closed nickgladman closed 2 weeks ago

nickgladman commented 1 month ago

Hello once more! Thanks yet again for this great resource. I'm running a new genome with Verkko v2.1 and got this error ~50% through all the steps (based on the log file). I've run this on a smaller genome in SGE without problem so I thought I requested enough resources, but maybe not?

Here is the input script (requested 96 cores and 100 Gb)

#!/bin/bash
#SBATCH --job-name="verkko"
#SBATCH -p medium
#SBATCH -N 1
#SBATCH -n 96
#SBATCH --mem=100000MB
#SBATCH -o "stdout.%j.%N"
#SBATCH -e "stderr.%j.%N"
date

verkko \
-d asm_seed_max_length100000 \
--seed-max-length 100000 \
--hifi /project/path/trimmed_*.fastq.gz \
--nano /project/path/ont.fasta.gz \

This is what halted the process

/home/nicholas.gladman/software/.conda/envs/verkko_2.1/bin/GraphAligner -t 24 -g ../2-processGraph/unitig-unrolled-hifi-resolved.gfa -f ../3-align/split/ont190.fasta.gz -a ../3-align/aligned190.WORKING.gaf \\
  \$diploid \\
  --seeds-mxm-cache-prefix \$prefix \\
  \$memwindow \\
  --seeds-mxm-length 30 \\
  --seeds-mem-count 100000 \\
  --bandwidth 15 \\
  --multimap-score-fraction 0.99 \\
  --precise-clipping 0.85 \\
  --min-alignment-score 5000 \\
  --hpc-collapse-reads \\
  --discard-cigar \\
  --clip-ambiguous-ends 100 \\
  --overlap-incompatible-cutoff 0.15 \\
  --max-trace-count 5 \\
  --mem-index-no-wavelet-tree \\
&& \\
mv -f ../3-align/aligned190.WORKING.gaf ../3-align/aligned190.gaf
EOF

chmod +x ./aligned190.sh

./aligned190.sh > ../3-align/aligned190.err 2>&1

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-05-29T155554.227970.snakemake.log
Traceback (most recent call last):
  File "/home/nicholas.gladman/software/.conda/envs/verkko_2.1/lib/python3.9/weakref.py", line 667, in _exitfunc
  File "/home/nicholas.gladman/software/.conda/envs/verkko_2.1/lib/python3.9/weakref.py", line 591, in __call__
  File "/home/nicholas.gladman/software/.conda/envs/verkko_2.1/lib/python3.9/tempfile.py", line 829, in _cleanup
  File "/home/nicholas.gladman/software/.conda/envs/verkko_2.1/lib/python3.9/tempfile.py", line 825, in _rmtree
  File "/home/nicholas.gladman/software/.conda/envs/verkko_2.1/lib/python3.9/shutil.py", line 724, in rmtree
  File "/home/nicholas.gladman/software/.conda/envs/verkko_2.1/lib/python3.9/shutil.py", line 722, in rmtree
OSError: [Errno 116] Stale file handle: '/home/nicholas.gladman/.cache/snakemake/snakemake/source-cache/runtime-cache/tmp5p6k_pa1'
./snakemake.sh: line 17: 394789 Bus error               (core dumped) snakemake verkko --nocolor --directory . --snakefile /home/nicholas.gladman/software/.conda/envs/verkko_2.1/lib/verkko/Snakefile --configfile verkko.yml --reason --keep-going --rerun-incomplete --rerun-triggers mtime --latency-wait 2 --cores all --resources mem_gb=64
skoren commented 1 month ago

This doesn't look like an issue in verkko. It's complaining about a stale file handle in your home directory and a bus error typically means the binary changed while the program was running. I suspect something happened with the file system during the run or perhaps the job exceeded a time limit. Either way, you should be able to resume the run with the same command and have snakemake pick up where it left off. FYI, you can also run verkko with --slurm which will submit jobs to the cluster rather than running them all on the single node you've reserved.

nickgladman commented 1 month ago

Thank you so much! I'll give that a go. I do have another basic question: when including the --slurm option, should I remove the following sbatch options or keep them in: `#SBATCH -N 1

SBATCH -n 96`

skoren commented 3 weeks ago

The -N 1 should be OK, I don't think you ever want the -n 96, that's 96 tasks per node, you probably want -c or --cpus-per-task. You can request just 1 or 2 cores for the main verkko command you submit which will be a busy wait process that submits all the other computation to the grid.

skoren commented 2 weeks ago

Idle