Closed bjreisman closed 3 years ago
Internal ticket PGAPX-786 opened, we will look at this soon.
), it still seem to be using all 16 cores.
FYI: if this is based on "CPU cores": 16,
output line, it actually contains the real number of CPUs in the system, not the cpus requested. As your report shows, the actually executed docker
command line contains --cpus 10
parameter.
Please let me know if this clears the problem.
Hmm... that appears to fix the CPU usage problem, but doesn't fix the larger problem which seems to be an IO bottleneck at the clsuter_blastp_wnode stage. It looks like there are still 16 instances of cluster_blastp_wnode attempting to run, I assumed that it was one per core. Is that not the case?
I assumed that it was one per core. Is that not the case?
Yes. It's supposed to work this way. I will check
Could you please post relevant parts of cwltool.log as before?
Certainly, it's running now, but I've attached it below: cwltool.log
It appears that Docker's --cpus
option does not alter what is presented via APIs such as /proc/cpuinfo
. As a result, PGAP does run slower, getting throttled (by cgroups
resource limits), but this does not reduce the number of threads and memory pressure as we expected. We'll have to investigate an appropriate fix.
Thanks, I see you are using --cpus 8
yup! I tried dropping from 12 to 8 to see if that would help. I've included a snapshot of the CPU stats from vmstat below:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- -----timestamp-----
r b swpd free inact active si so bi bo in cs us sy id wa st CDT
5 3 1027 290 7346 22466 0 0 155 76 40 141 31 3 56 9 0 2020-06-01 13:49:27
pgap also ignores the --cpu
flag when using singularity instead of docker. Using all available cpus is causing the workflow to run out of memory. Should this be a separate issue?
Should this be a separate issue?
Yes. Thanks for reporting!
As status update: we are actively exploring a solution for this issue, but we are not yet ready to commit to a timeline for releasing a fix.
I have the similar issue: --cpus flag does not work. Would you have any other way to control how many cpus are used?
Hello Azat,
This is Marie again :)
I think I may have the same issue but I am not sure (I'm not comfortable with Docker yet). After you helped me with issue #129 129, I tried to run without a --cpu flag and I got:
Original command: ./pgap.py -r -o TemS.CL/results TemS.CL/TemS_S96.generic.yaml
Docker command: /usr/bin/docker run -i --rm --user 1000:1000 --volume /home/adm-loc/Tools/pgap/input-2021-01-11.build5132:/pgap/input:ro,z --volume /home/adm-loc/Tools/pgap/TemS.CL:/pgap/user_input:z --volume /home/adm-loc/Tools/pgap/TemS.CL/pgap_input_b9bnz7y3.yaml:/pgap/user_input/pgap_input.yaml:ro,z --volume /tmp:/tmp:rw,z --volume /home/adm-loc/Tools/pgap/TemS.CL/results:/pgap/output:rw,z ncbi/pgap:2021-01-11.build5132 cwltool --timestamps --debug --disable-color --preserve-entire-environment --outdir /pgap/output pgap/pgap.cwl /pgap/user_input/pgap_input.yaml
STDOUT/STDERR: PGAP version 2021-01-11.build5132 is up to date. Output will be placed in: /home/adm-loc/Tools/pgap/TemS.CL/results WARNING: memory per CPU core (GiB) is less than the recommended value of 2 PGAP failed, docker exited with rc = 1 Unable to find error in log file.
Runtime Report from the cwltool.log: "CPU cores": 32, "Docker image": "ncbi/pgap:2021-01-11.build5132", "cpu flags": "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d", "cpu model": "Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz", "max user processes": "unlimited", "memory (GiB)": 15.6, "memory per CPU core (GiB)": 0.5, "open files": 1048576, "tmp disk space (GiB)": 1235.0, "virtual memory": "unlimited", "work disk space (GiB)": 1235.0
I thus tried to only use 6 cores by running: ./pgap.py -r -o TemS.CL/results TemS.CL/TemS_S96.generic.yaml --cpus 6
But got the same STDOUT/STDERR, same Runtime report and (if I'm correct), same Docker command (which I paste just in case): Docker command: /usr/bin/docker run -i --rm --user 1000:1000 --volume /home/adm-loc/Tools/pgap/input-2021-01-11.build5132:/pgap/input:ro,z --volume /home/adm-loc/Tools/pgap/TemS.CL:/pgap/user_input:z --volume /home/adm-loc/Tools/pgap/TemS.CL/pgap_input__zxn00sh.yaml:/pgap/user_input/pgap_input.yaml:ro,z --volume /tmp:/tmp:rw,z --volume /home/adm-loc/Tools/pgap/TemS.CL/results:/pgap/output:rw,z ncbi/pgap:2021-01-11.build5132 /bin/taskset -c 0-5 cwltool --timestamps --debug --disable-color --preserve-entire-environment --outdir /pgap/output pgap/pgap.cwl /pgap/user_input/pgap_input.yaml
Could you confirm that I got the same issue and just have to wait for the next release? If so, any idea when this would be?
Thanks a lot!
Best, Marie
Could you confirm that I got the same issue and just have to wait for the next release?
No. Not without the actual log.
In our environment we are testing, for historic reasons on computers with 4Gb/core (that's AWS settings) since you have only 16Gb, could you please try --cpu 4
?
PS. 16Gb is pretty low memory nowadays for Windows 10. Especially for running heavy computations.
Any of the sequences marked as plasmids in FASTA headers in your test case, Marie?
Regarding confirmation that only 6 CPU are requested, note the /bin/taskset -c 0-5
in the logged command line. That restricts execution to CPUs 0 through 5 (6 CPUs total). There are other more detailed log files (in debug mode) which will also report the number of CPUs being used, which should further confirm the setting.
In contrast, the runtime report at the top shows how many are available in total (we should probably amend the warning message to be less confusing).
Thanks for your answers!
I will comment on each point:
I just tried --cpu 4, but it still fails (I attach the log) cwltool.log
I understand that 16Go isn't much but I successfully annotated a bunch of genomes locally before (though it took hours everytime indeed). I add that I didn't have to use this --cpu parameter then; would it help if I also attached a log from one of these?
I found no sequence marked as plasmid but I still attach the fasta in case you find something weird (renamed as .fasta.txt bc .fasta wouldn't be attached for some reason) TemS_S96.agp.fasta.txt
the runtime report is confusing indeed but now I understand that the flag does get passed to Docker. What's wrong then? Maybe I should open a new issue?
The cwltool.log says:
[2021-03-01 14:11:59] INFO Resolved 'pgap/pgap.cwl' to 'file:///pgap/pgap/pgap.cwl'
[2021-03-01 14:11:59] ERROR I'm sorry, I couldn't load this CWL file.
and later:
found duplicate key "report_usage" with value "True"
most likely your input YAML file contains report_usage:
setting. If yes, please try to remove that setting and run it again.
Hey Azat,
I did have a: report_usage: true setting in the generic.yaml file.
I removed it and still got a: WARNING: memory per CPU core (GiB) is less than the recommended value of 2 But now it is running :)
Thanks for helping again!
Best, Marie
You are welcome, Marie!
I'm trying to run PGAP on our local machine with 16 cores and 32 GB of ram, which for some reason comes out to 1.9 GB of ram per core. The MG37 dataset completed just fine, but I'm running into problems on my own genomes (~10MB). I though the issue might be a mismatch between RAM and cores which could be solved by requesting less cores, but when I set the --cpus option to 10 (for example), it still seem to be using all 16 cores.
Expected behavior I'd like to allocate a specific number of cores to the pgap docker container. It's possible (likely) there's another way to do this that I missed, but I thought the --cpus option would do the trick.
Software versions (please complete the following information):
docker --version
]: Docker version 19.03.9, build 9d988398e7Log Files (first few lines of cwltool.log, happy to share the rest if needed)
Additional context Add any other context about the problem here.