Open kkrick-sdsu opened 12 months ago
Thank you for your detailed report (including attempts to mitigate this on your side, appreciated!), Kyle!
Yes, SLURM environment tends to be problematic for PGAP. We will assess this situation ASAP.
Question:
Does
#SBATCH --cpus-per-task=4
work like setting envar SLURM_CPUS_PER_TASK?
Note that the file about which singularity complains, does exist, according to your listing. Is it possible that you have some kind of local singularity settings that disfavor your directory as a source of mount?
Also, I just found out, in our FAQ, that:
While nothing in the software intentionally prevents use on a cluster, we cannot provide assistance for this use case, given the additional complexity
Hi Azat,
As it turns out, the file exists temporarily. I cleaned up the directory and re-ran the batch job. That listing was grabbed while PGAP was trying to run. Once the run fails, a secondary listing shows that it the file is no longer there. PGAP must be deleting the file after the run fails. With that revelation, I am not sure what is the actual problem.
From my understanding, SLURM_CPUS_PER_TASK gets set and is useable while the job is running (for instance, to pass to a program so that it knows the actual core count it has to deal with) and the sbatch flag --cpus-per-task actually controls how many are requested during scheduling.
I understand about not being able to offer support. On that FAQ I do see that --no-internet may help, so I will try that as well.
For what it is worth, this was working with PGAP version 2023-05-17.build6771. I wish I had known about the --no-self-update flag, as my woes started when PGAP updated itself.
Kyle, you can still run the May version, by using use-version parameter and, as you discovered yourself, --no-self-update
flag.
As it turns out, the file exists temporarily. I cleaned up the directory and re-ran the batch job. That listing was grabbed while PGAP was trying to run. Once the run fails, a secondary listing shows that it the file is no longer there. PGAP must be deleting the file after the run fails. With that revelation, I am not sure what is the actual problem.
I opened an internal investigation (code PGAPX-1229) for this, Kyle.
Describe the bug PGAP fails to start a singularity container because it is attempting to bind a file that does not yet exist.
To Reproduce Using PGAP version 2023-10-03.build7061. Followed steps from quick start.
Starting directory structure:
Submitting SLURM job with the following sbatch script named
pgap.slurm
:Results in the following directory structure:
slurm-18657.out:
cwltool.log:
I attempted to run the docker command directly and got the following error:
So, it appears to be failing because the file
pgap_input_1zzxtcbo.yaml
does not exist.Expected behavior PGAP should run successfully.
Software versions (please complete the following information):
Log Files Ran with --debug but the debug and debug/log directories were empty.
Additional context I had read some troubleshooting from other reported issues and tried this for the sbatch script, same results: