ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
316 stars 88 forks source link

Error during annotation - taskset: failed to set pid 0's affinity: Invalid argument #202

Closed matnguyen closed 2 years ago

matnguyen commented 2 years ago

Hello, I am trying to run PGAP on a dataset of E. coli assembled contigs. The key error that occurs on some samples is: taskset: failed to set pid 0's affinity: Invalid argument. Some samples work fine while some have this error.

Here is my cwltool.log

Would you happen to know what the error is caused by? Thanks in advance!

azat-badretdin commented 2 years ago

Thank you, Matthew, for your report. Did you try running the example from Quick Start notes?

matnguyen commented 2 years ago

Yes the example in Quick Start works fine

azat-badretdin commented 2 years ago

Thanks, Matthew! Would you be willing to share your input files with us, privately?

matnguyen commented 2 years ago

Sure thing, how abouts do I proceed with that?

azat-badretdin commented 2 years ago

Please email the genome to prokaryote-tools@ncbi.nlm.nih.gov. Either attach the genome or tell us where to get it.

Thanks!

azat-badretdin commented 2 years ago

Thanks, Matthew! Apologies about the delay. We opened an internal ticket for this and will look at it soon.

azat-badretdin commented 2 years ago

Hmm... I am running the example your sent us right now and it already moved on beyond the breakage point...

[1]+  Running                 time ./pgap.py --cpus 4 -n --no-internet --ignore-all-errors --container-path /home/badrazat/pgap_2022-04-14.build6021.sif -o /home/badrazat/E195_S21 /home/badrazat/input1.yaml --docker singularity &>my.log &

note that I am using 4 CPUs, not 8, because my host has only 4 cpus.

Also, as you can see I had to add explicitly --docker singularity to be able to go beyond script errors (it started with trying to read the SIF container with docker). But apparently you did not need it. You can try to add it, of course....

So far, we were not able to reproduce the problem. The execution moved beyond breakage point without any trace of your error:


Original command: ./pgap.py --cpus 4 -n --no-internet --ignore-all-errors --container-path /home/badrazat/pgap_2022-04-14.build6021.sif -o /home/badrazat/E195_S21 /home/badrazat/inp
ut1.yaml --docker singularity

Docker command: /usr/local/bin/singularity exec --bind /home/badrazat/.pgap/input-2022-04-14.build6021:/pgap/input:ro --bind /home/badrazat:/pgap/user_input --bind /home/badrazat/pg
ap_input_wxc0wsrw.yaml:/pgap/user_input/pgap_input.yaml:ro --bind /tmp:/tmp:rw --bind /home/badrazat/E195_S21.1:/pgap/output:rw --pwd /pgap /home/badrazat/pgap_2022-04-14.build6021.
sif /bin/taskset -c 0-3 cwltool --timestamps --debug --disable-color --preserve-entire-environment --outdir /pgap/output pgap/pgap.cwl /pgap/user_input/pgap_input.yaml

--- Start YAML Input ---
fasta:
  class: File
  location: E195_S21.fasta
submol:
  class: File
  location: pgap_submol_i539ige3.yaml
supplemental_data: { class: Directory, location: /pgap/input }
report_usage: false
ignore_all_errors: true
no_internet: true
--- End YAML Input ---

--- Start Runtime Report ---
{
    "CPU cores": 4,
    "Docker image": "/home/badrazat/pgap_2022-04-14.build6021.sif",
    "cpu flags": "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke",
    "cpu model": "Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz",
    "max user processes": "unlimited",
    "memory (GiB)": 31.0,
    "memory per CPU core (GiB)": 7.8,
    "open files": 8192,
    "tmp disk space (GiB)": 194.4,
    "virtual memory": "unlimited",
    "work disk space (GiB)": 194.4
}
--- End Runtime Report ---

[2022-05-20 10:56:34] INFO /pgap/venv/bin//cwltool 3.1.20220224085855
[2022-05-20 10:56:34] INFO Resolved 'pgap/pgap.cwl' to 'file:///pgap/pgap/pgap.cwl'
pgap/pgap.cwl:22:7: Warning: Field `location` contains undefined reference to
                    `file:///pgap/pgap/input`
[2022-05-20 11:01:29] DEBUG [workflow ] initialized from file:///pgap/pgap/pgap.cwl
[2022-05-20 11:01:29] INFO [workflow ] start
azat-badretdin commented 2 years ago

Have you used taskset to limit the amount of CPUs?

azat-badretdin commented 2 years ago

Another possibility is that you are running PGAP using some cluster scheduler (such as SGE/UGE or Slurm). If yes, what was the job resource specification? Feel free to use --cpus value that matches that specification.

matnguyen commented 2 years ago

@azat-badretdin

Sorry for the delay. I have tried to limit the CPUs. Perhaps it is indeed the cluster scheduler. Here is my batch file for Slurm:

#!/bin/bash
#SBATCH --time=24:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --job-name=pgap

export PGAP_INPUT_DIR=/home/mnguyen/ecoli/bin/pgap
python /home/mnguyen/ecoli/bin/pgap/pgap.py --cpus 8 -n --no-internet --ignore-all-errors --container-path /home/mnguyen/ecoli/bin/pgap/pgap_2022-04-14.build6021.sif -o /home/mnguyen/scratch/ecoli/pgap/${ID}/${ID} /home/mnguyen/scratch/ecoli/pgap/${ID}/input.yml
nitishnih commented 2 years ago

Hello,

I am encountering the same issue working on an HPC cluster running SLURM. I am trying to run the Quick Start data set. Here are the commands I ran in a (interactive) job with 8 CPUs:

module load singularity python/3.8
export PGAP_INPUT_DIR=/data/$USER/pgap
wget https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py
chmod +x pgap.py
./pgap.py --update -D singularity
./pgap.py -n --container-path /data/$USER/pgap_2022-04-14.build6021.sif \
        -D singularity --no-internet -c 8 -o /data/$USER/mg37_results \
        $PGAP_INPUT_DIR/test_genomes/MG37/input.yaml

Everything runs fine up until the last command. The output of the last command is:

--no-internet flag enabled, not checking remote versions.
Output will be placed in: /data/$USER/mg37_results
PGAP failed, docker exited with rc = 1
Unable to find error in log file.

The last line of /data/$USER/mg37_results/cwltool.log is:

taskset: failed to set pid 0's affinity: Invalid argument

If I remove -c 8 from the command above, I do not get this error and the run finishes successfully. However, as expected, pgap uses more than the allocated CPUs which is not good on a shared resource. Let me know if I can provide more information.

tbazilegith commented 2 years ago

Hello, I keep having the same issue After I removed -c 8, I got this while the was still running: WARNING: tmp disk space (GiB) is less than the recommended value of 10 At the end this WARNING Final process status is permanentFail Does anyone know what i would be? Thanks! TJ

azat-badretdin commented 2 years ago

@matnguyen Matthew, were you able to fix the problem on your end? If yes, could you please share the solution with other users? Thanks!

tbazilegith commented 2 years ago

I am still troubleshooting, but I am working on a HPC environment, where disk space shouldn't be an issue.

azat-badretdin commented 2 years ago

Thanks

matnguyen commented 2 years ago

I wasn't able to fix the problem on my end, sometimes rerunning PGAP would work, so seems like some issue with singularity on an HPC

azat-badretdin commented 2 years ago

Thanks. Well, I hope you can resolve it.

MrTomRod commented 1 year ago

I get the same error message and the same unpredictable behavior ("sometimes rerunning PGAP would work"). Removing --cpus does not help, though.

Ah, I hate Singularity. Soon our cluster will be based on Rocky Linux (a successor of CentOS), and then I'll hopefully be able to use podman or rootless docker.

azat-badretdin commented 1 year ago

Sorry to hear about that, Thomas.

dustin-cram commented 1 year ago

I believe the problem here is that pgap.py really shouldn't be using taskset, especially in a traditional HPC cluster environment.

taskset binds a process to specific processor cores. In this case, pgap.py is explicitly choosing the cores numbered 0 through n-1, where n is the value provided to --cpus. But Slurm and other schedulers also bind the process to specific cores and they may choose a different set of cores. This explains why @matnguyen observes some samples failing and some succeeding. If by chance there is at least one core shared between the set of cores chosen by taskset and Slurm, then it will run. If not, it will fail. And even if it runs, it is presumably only able to make use of the number of cores in the intersection of those two sets.

If I comment out the lines (195 and 196) in pgag.py where taskset is applied (and add the --no-self-update flag), then I no longer observe any failures running under Slurm.

Even when running outside a cluster environment, I think the use of taskset may be a problem. If a user is running multiple instances of PGAP in parallel, the taskset is going to force them all to share the same set of cores which may be fewer than what is available on the system.

Perhaps there is something unique about the environment at NCBI that makes taskset appropriate, but in just about any other environment I believe it will either cause problems, or at best, neither help nor harm.