Closed matnguyen closed 2 years ago
Thank you, Matthew, for your report. Did you try running the example from Quick Start notes?
Yes the example in Quick Start works fine
Thanks, Matthew! Would you be willing to share your input files with us, privately?
Sure thing, how abouts do I proceed with that?
Please email the genome to prokaryote-tools@ncbi.nlm.nih.gov. Either attach the genome or tell us where to get it.
Thanks!
Thanks, Matthew! Apologies about the delay. We opened an internal ticket for this and will look at it soon.
Hmm... I am running the example your sent us right now and it already moved on beyond the breakage point...
[1]+ Running time ./pgap.py --cpus 4 -n --no-internet --ignore-all-errors --container-path /home/badrazat/pgap_2022-04-14.build6021.sif -o /home/badrazat/E195_S21 /home/badrazat/input1.yaml --docker singularity &>my.log &
note that I am using 4 CPUs, not 8, because my host has only 4 cpus.
Also, as you can see I had to add explicitly --docker singularity
to be able to go beyond script errors (it started with trying to read the SIF container with docker). But apparently you did not need it. You can try to add it, of course....
So far, we were not able to reproduce the problem. The execution moved beyond breakage point without any trace of your error:
Original command: ./pgap.py --cpus 4 -n --no-internet --ignore-all-errors --container-path /home/badrazat/pgap_2022-04-14.build6021.sif -o /home/badrazat/E195_S21 /home/badrazat/inp
ut1.yaml --docker singularity
Docker command: /usr/local/bin/singularity exec --bind /home/badrazat/.pgap/input-2022-04-14.build6021:/pgap/input:ro --bind /home/badrazat:/pgap/user_input --bind /home/badrazat/pg
ap_input_wxc0wsrw.yaml:/pgap/user_input/pgap_input.yaml:ro --bind /tmp:/tmp:rw --bind /home/badrazat/E195_S21.1:/pgap/output:rw --pwd /pgap /home/badrazat/pgap_2022-04-14.build6021.
sif /bin/taskset -c 0-3 cwltool --timestamps --debug --disable-color --preserve-entire-environment --outdir /pgap/output pgap/pgap.cwl /pgap/user_input/pgap_input.yaml
--- Start YAML Input ---
fasta:
class: File
location: E195_S21.fasta
submol:
class: File
location: pgap_submol_i539ige3.yaml
supplemental_data: { class: Directory, location: /pgap/input }
report_usage: false
ignore_all_errors: true
no_internet: true
--- End YAML Input ---
--- Start Runtime Report ---
{
"CPU cores": 4,
"Docker image": "/home/badrazat/pgap_2022-04-14.build6021.sif",
"cpu flags": "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke",
"cpu model": "Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz",
"max user processes": "unlimited",
"memory (GiB)": 31.0,
"memory per CPU core (GiB)": 7.8,
"open files": 8192,
"tmp disk space (GiB)": 194.4,
"virtual memory": "unlimited",
"work disk space (GiB)": 194.4
}
--- End Runtime Report ---
[2022-05-20 10:56:34] INFO /pgap/venv/bin//cwltool 3.1.20220224085855
[2022-05-20 10:56:34] INFO Resolved 'pgap/pgap.cwl' to 'file:///pgap/pgap/pgap.cwl'
pgap/pgap.cwl:22:7: Warning: Field `location` contains undefined reference to
`file:///pgap/pgap/input`
[2022-05-20 11:01:29] DEBUG [workflow ] initialized from file:///pgap/pgap/pgap.cwl
[2022-05-20 11:01:29] INFO [workflow ] start
Have you used taskset
to limit the amount of CPUs?
Another possibility is that you are running PGAP using some cluster scheduler (such as SGE/UGE or Slurm). If yes, what was the job resource specification? Feel free to use --cpus
value that matches that specification.
@azat-badretdin
Sorry for the delay. I have tried to limit the CPUs. Perhaps it is indeed the cluster scheduler. Here is my batch file for Slurm:
#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --job-name=pgap
export PGAP_INPUT_DIR=/home/mnguyen/ecoli/bin/pgap
python /home/mnguyen/ecoli/bin/pgap/pgap.py --cpus 8 -n --no-internet --ignore-all-errors --container-path /home/mnguyen/ecoli/bin/pgap/pgap_2022-04-14.build6021.sif -o /home/mnguyen/scratch/ecoli/pgap/${ID}/${ID} /home/mnguyen/scratch/ecoli/pgap/${ID}/input.yml
Hello,
I am encountering the same issue working on an HPC cluster running SLURM. I am trying to run the Quick Start data set. Here are the commands I ran in a (interactive) job with 8 CPUs:
module load singularity python/3.8
export PGAP_INPUT_DIR=/data/$USER/pgap
wget https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py
chmod +x pgap.py
./pgap.py --update -D singularity
./pgap.py -n --container-path /data/$USER/pgap_2022-04-14.build6021.sif \
-D singularity --no-internet -c 8 -o /data/$USER/mg37_results \
$PGAP_INPUT_DIR/test_genomes/MG37/input.yaml
Everything runs fine up until the last command. The output of the last command is:
--no-internet flag enabled, not checking remote versions.
Output will be placed in: /data/$USER/mg37_results
PGAP failed, docker exited with rc = 1
Unable to find error in log file.
The last line of /data/$USER/mg37_results/cwltool.log is:
taskset: failed to set pid 0's affinity: Invalid argument
If I remove -c 8
from the command above, I do not get this error and the run finishes successfully. However, as expected, pgap uses more than the allocated CPUs which is not good on a shared resource. Let me know if I can provide more information.
Hello, I keep having the same issue After I removed -c 8, I got this while the was still running: WARNING: tmp disk space (GiB) is less than the recommended value of 10 At the end this WARNING Final process status is permanentFail Does anyone know what i would be? Thanks! TJ
@matnguyen Matthew, were you able to fix the problem on your end? If yes, could you please share the solution with other users? Thanks!
I am still troubleshooting, but I am working on a HPC environment, where disk space shouldn't be an issue.
Thanks
I wasn't able to fix the problem on my end, sometimes rerunning PGAP would work, so seems like some issue with singularity on an HPC
Thanks. Well, I hope you can resolve it.
I get the same error message and the same unpredictable behavior ("sometimes rerunning PGAP would work"). Removing --cpus
does not help, though.
Ah, I hate Singularity. Soon our cluster will be based on Rocky Linux (a successor of CentOS), and then I'll hopefully be able to use podman
or rootless docker
.
Sorry to hear about that, Thomas.
I believe the problem here is that pgap.py really shouldn't be using taskset, especially in a traditional HPC cluster environment.
taskset binds a process to specific processor cores. In this case, pgap.py is explicitly choosing the cores numbered 0 through n-1, where n is the value provided to --cpus
. But Slurm and other schedulers also bind the process to specific cores and they may choose a different set of cores. This explains why @matnguyen observes some samples failing and some succeeding. If by chance there is at least one core shared between the set of cores chosen by taskset and Slurm, then it will run. If not, it will fail. And even if it runs, it is presumably only able to make use of the number of cores in the intersection of those two sets.
If I comment out the lines (195 and 196) in pgag.py where taskset is applied (and add the --no-self-update
flag), then I no longer observe any failures running under Slurm.
Even when running outside a cluster environment, I think the use of taskset may be a problem. If a user is running multiple instances of PGAP in parallel, the taskset is going to force them all to share the same set of cores which may be fewer than what is available on the system.
Perhaps there is something unique about the environment at NCBI that makes taskset appropriate, but in just about any other environment I believe it will either cause problems, or at best, neither help nor harm.
Hello, I am trying to run PGAP on a dataset of E. coli assembled contigs. The key error that occurs on some samples is:
taskset: failed to set pid 0's affinity: Invalid argument
. Some samples work fine while some have this error.Here is my cwltool.log
Would you happen to know what the error is caused by? Thanks in advance!