ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
294 stars 89 forks source link

[FEATURE REQUEST] Workaround for "taskset: failed to set pid 0's affinity"-Bug #289

Closed MrTomRod closed 5 months ago

MrTomRod commented 5 months ago

Describe the bug I have the same issue as on https://github.com/ncbi/pgap/issues/202 on our Rocky 8 SLURM cgroup v2 HPC. Some experimentation suggests that there is no need to do run taskset in pgap.py, at least under certain circumstances, see below.

Suggestion I think pgap.py should have a flag --no-assign-cpu-affinities to disable this. Otherwise, I have to check for updates manually and edit the pgap.py file every time I forget to run pgap.py without --no-self-update.

To Reproduce I can check the current CPU affinities like this:

$ taskset -pc $(echo $$)
pid 123456's current affinity list: 21,22,26

These affinities are correctly propagated to apptainer containers:

$ apptainer exec --pwd /pgap docker://ncbi/pgap:2023-10-03.build7061 taskset -pc $(echo $$)
pid 123456's current affinity list: 21,22,26

Software versions (please complete the following information):

Additional info In case it's somehow useful, to get the assigned CPUs in Python, do this:

import subprocess
assigned_cpus = subprocess.check_output(["bash", "-c", "taskset -pc $(echo $$)"]).decode().split()[-1]
# self.cmd.extend(['/bin/taskset', '-c', assigned_cpus])
azat-badretdin commented 5 months ago

Hi, Thomas!

Thank you for reporting this problem.

I think pgap.py should have a flag --no-assign-cpu-affinities to disable this.

The introduction of a call to taskset was a result of trying to implement limits on the number of cores used by our docker image and the binaries inside. Removal of this call will result in the absence of this restriction and the application will use all available cores. If this is what you want, would you be open to considering the following workaround?

Workaround

Currently the call to taskset happens if this function returns non-zero value of cores requested (in command line, and recently, from the environment as well (PGAPX-1095)):

def get_cpus(self):
    if 'SLURM_CPUS_PER_TASK' in os.environ:
        return int(os.environ['SLURM_CPUS_PER_TASK'])
    elif 'NSLOTS' in os.environ:
        return int(os.environ['NSLOTS'])
    elif self.params.args.cpus:
        return self.params.args.cpus 
    else:
        return 0 

As you can see it is called specifically for the case of SLURM environment.

Here is a workaround: unsetting SLURM_CPUS_PER_TASK for the sake of PGAP run will lead to the same result as the proposed flag I believe:

env -u SLURM_CPUS_PER_TASK ./pgap.py ...

What do you think?

MrTomRod commented 5 months ago

Ok, that's fine. Thanks!