Open skchronicles opened 1 month ago
I am not super familiar with slurm but it seems like task 22 failed right before it was going to write the examples. Any chance you are removing the intermediate_results_dir
during the program termination? Can you save it somewhere outside of tmp and see if it works? Also, you can try out the quickstart which should be super quick to run: https://github.com/google/deepsomatic/blob/r1.7/docs/deepsomatic-quick-start.md to see if the issue is with DeepSomatic or with your slurm setup.
Okay, that sounds good. I was thinking it was running out of disk space too for the --intermediate_results_dir
. That is pointing to local node storage that you allocate through the job scheduler. I was allocating 1200GB of space and I was adding some extra commands to check for disk space usage on failure or exit. After reviewing the logs, it appeared to have only used around 400-500GB of disk space, but I will point to another output directory instead. I will keep you in the loop, and I will let you know if that resolves the issue.
I have another colleguage that was able to run the same command on another SLURM cluster. One interesting thing in the log file that I attached is some issues related to thread creation. Have you run into this issue before?
Hey @kishwarshafin,
I hope you are having a great day! I just wanted to provide an update as I was able to test two different things on my side.
I ran the same deepsomatic command using the same input BAM files but I changed the location of the --intermediate_results_dir
option to point to more persistent storage with an excess of space (20TB).
run_deepsomatic \
--model_type=WGS \
--ref=/path/to/refs/Homo_sapiens_assembly38.fasta \
--reads_tumor=/path/to/output/BAM/WGS_NCI_T_1_1ug_S1.sorted.bam --reads_normal=/path/to/output/BAM/WGS_NCI_N_1_1ug_S2.sorted.bam \
--sample_name_tumor=WGS_NCI_T_1_1ug_S1 --sample_name_normal=WGS_NCI_N_1_1ug_S2 \
--output_vcf=/path/to/output/deepsomatic/somatic/WGS_NCI_T_1_1ug_S1.deepsomatic.vcf \
--num_shards=24 \
--intermediate_results_dir=/path/to/output/deepsomatic/somatic/WGS_NCI_T_1_1ug_S1_tmp
After updating the --intermediate_results_dir
to point to another location (persistent disk with an over-abundance of storage), I am still running into the same error. The make_examples_somatic
command is failing for another chunk (now it is chunk/task 6).
# NEW: export OpenBLAS variable to
# to prevent issues related to pthread init
export OPENBLAS_NUM_THREADS=1
# Setups temporary directory for
# intermediate files with built-in
# mechanism for deletion on exit
if [ ! -d "/lscratch/$SLURM_JOBID/" ]; then mkdir -p "/lscratch/$SLURM_JOBID/"; fi
tmp=$(mktemp -d -p "/lscratch/$SLURM_JOBID/")
trap 'rm -rf "${tmp}"' EXIT
run_deepsomatic \
--model_type=WGS \
--ref=/path/to/refs/Homo_sapiens_assembly38.fasta \
--reads_tumor=/path/to/output/BAM/WGS_NCI_T_1_1ug_S1.sorted.bam --reads_normal=/path/to/output/BAM/WGS_NCI_N_1_1ug_S2.sorted.bam \
--sample_name_tumor=WGS_NCI_T_1_1ug_S1 --sample_name_normal=WGS_NCI_N_1_1ug_S2 \
--output_vcf=/path/to/output/deepsomatic/somatic/WGS_NCI_T_1_1ug_S1.deepsomatic.vcf \
--num_shards=24 \
--intermediate_results_dir=${tmp}
This appears to have resolved/fixed the issue. The make_examples_somatic
ran without any errors, and the warning messages related to pthread creation were not any the log file. With that being said, I think the issue is somehow related to pthread/dthread creation. I was able to run deepsomatic without any issues by exporting the following environment variable prior to running deepsomatic: export OPENBLAS_NUM_THREADS=1
. With that being said, this issue may be system-specific though as the HPC system I am using has hyperthreading enabled.
Please let me know what you think.
Best regards, @skchronicles
@skchronicles ,
I am not experienced in SLURM or the environment so I can't exactly tell you if that's the case. But it seems like you figured it out correctly and it possibly is a system specific issue. If you have a SLURM manager/contact, I think it'd be best to run it by them as they would know more about the environment.
Hey @kishwarshafin,
I just checked up on a running job, and it appears that once call_variants
starts running CPU usage spikes up quite a bit over the 24 threads/shards that were allocated.
Here was the command that was run:
# NEW: export OpenBLAS variable to
# to prevent issues related to pthread init
export OPENBLAS_NUM_THREADS=1
# Setups temporary directory for
# intermediate files with built-in
# mechanism for deletion on exit
if [ ! -d "/lscratch/$SLURM_JOBID/" ]; then mkdir -p "/lscratch/$SLURM_JOBID/"; fi
tmp=$(mktemp -d -p "/lscratch/$SLURM_JOBID/")
trap 'rm -rf "${tmp}"' EXIT
run_deepsomatic \
--model_type=WGS \
--ref=/path/to/refs/Homo_sapiens_assembly38.fasta \
--reads_tumor=/path/to/output/BAM/WGS_NCI_T_1_1ug_S1.sorted.bam --reads_normal=/path/to/output/BAM/WGS_NCI_N_1_1ug_S2.sorted.bam \
--sample_name_tumor=WGS_NCI_T_1_1ug_S1 --sample_name_normal=WGS_NCI_N_1_1ug_S2 \
--output_vcf=/path/to/output/deepsomatic/somatic/WGS_NCI_T_1_1ug_S1.deepsomatic.vcf \
--num_shards=24 \
--intermediate_results_dir=${tmp}
The call_variants step starts to run around the 5AM looking at the log file, and if I check CPU usage for the running job that is also when CPU usage spikes: Log file Job Dashboard
I was hoping that setting that environment variable would prevent any instances of nested parallelism but it appears that may still be happening with call_variants. If you look at the CPU spikes, the peaks roughly appear to be around 2*num_shards. Have you ever observed this before, and does the --num_shards=24
roughly correlate with the max number of threads/processes spawned by deepsomatic?
Please let me know what you think.
Best regards, @skchronicles
@skchronicles , the num_shards
only controls number of CPU used for make_examples and postprocess_examples stages. For call_variants
the tensorflow API tries to use all resources available as that is most optimal. If you are using docker then you can add --cpus 32
to limit the CPU usage. For SLURM, please set the variable that limits the number of CPU for your job.
Hey @kishwarshafin,
I hope you had a great weekend. Thank you for the quick and insightful response!
For call_variants the tensorflow API tries to use all resources available as that is most optimal.
Would it make sense to limit the number of threads spawned by the tensorflow API within the call_variants
command? It appears this may be possible (even with environment variables); however, I have never tested it out: https://www.tensorflow.org/api_docs/python/tf/config/threading
Please let me know what you think.
Best regards, @skchronicles
Hello there,
Thank you for creating and maintaining this amazing tool. The latest version of deepsomatic looks awesome!
I have been testing the latest docker image of deepsomatic (docker://google/deepsomatic:1.7.0) against the SEQC2 tumor-normal pair to evaluate the tool's precision/recall against a truthset. While running the tool, I have run into an error during the
make_examples_somatic
for one of the intermediate shards.Here is relevant traceback right before the tool errors out:
I am also attaching the full log file too. Please let me know what you think, and have a great day! slurm-12345.txt
Here is the deepsomatic command that was run:
Best regards, @skchronicles