PGAP failing with error "PGAP failed, docker exited with rc = 1 error" " permanentFail"

Bill-Branson commented 2 years ago

Hello, We started seeing this problem several weeks back. We have tried many iterations of different configurations and such. Our cluster is using SLURM as the job manager. It is using Centos 7.8.2003. We have a variety of nodes, standard compute, hi-memory, and GPU nodes.

################## Problem is similar in error code to problem # 162. ##############

https://github.com/ncbi/pgap/issues/162

This problem presented itself to us a several weeks ago and we have been going through trying to knock off some of the things we find were presented in problem # 162.

################## What ##########################

Application: PGAP 2012-07-01

Supporting software: Python 3.9.1

Problem: failing with a permanentFail error for docker and gets rc=1 error.

"PGAP failed, docker exited with rc = 1 error" " permanentFail"

This error was similar in problem 162.

However, I read that thread and tried some of the things that were talked about.

We rebuilt the yaml file with Virtual Studio.
We increased memory.
We upped the ulimit for open files.
We tried the --ignore-all-errors. ( this intially got us past some portions of the error. )

################################################

I have attached all of pertinent files. This was done in a sequential method and the numbers are job id numbers in slurm. Lowest number was at the beginning of this and then the numbers go up with each try for a fix.

Let me know if you have any questions.

Thank you.

Sincerely

Bill Branson wbranson@mcw.edu

submol_OG1RF.txt slurm-78784out.log slurm-77450out.log slurm-76281out.log cwltool_77450.log cwltool_78784.txt cwltool_76281.txt cwltool_1.txt

azat-badretdin commented 2 years ago

Thank you for your comprehensive report, Bill!

I browsed through the cwltool logs and I suspect that the problem is with the system configuration. I see that memory wise it is pretty hefty:

for example: cwltool_78784.txt run reports

    "CPU cores": 48,
    "Docker image": "ncbi/pgap:2021-07-01.build5508",
    "cpu flags": "fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear spec_ctrl intel_stibp flush_l1d arch_capabilities",
    "cpu model": "Intel(R) Xeon(R) Gold 6240R CPU @ 2.40GHz",
    "max user processes": 4096,
    "memory (GiB)": 1510.4,
    "memory per CPU core (GiB)": 31.5,
    "open files": 4096,
    "tmp disk space (GiB)": 446.9,
    "virtual memory": "unlimited",
    "work disk space (GiB)": 5679.0

Plenty of memory per CPU core. But the number of open files is only 4096 which does not seemed to be bumped up.

So is the number of open files reported on other log files.

Could you please try to increase this number to unlimited (or some really high value) please? I see plenty of folks on our Issues reporting much higher values for this parameter (binary million)

I suspect that this might be related to the high number of cores that somehow demands higher number of open files.

azat-badretdin commented 2 years ago

Please have in mind that general docker setup also affects the number of open files.

Bill-Branson commented 2 years ago

Hello Everyone, I have been away for a few weeks. We looked at upping the ULIMIT past 4096. However, there is a hard limit on our cluster config overall and upping the hard ULIMIT from 4096 to something else would directly affect all jobs running on the cluster after that adjustment. I am researching another way to up the hard limit and keep other programs from being affected by this change. Maybe I need to up the hard limit on the overall config, and then find a way to throttle the programs and such that would be affected by this change, in a config that might only affect each program on a case by case basis.

Thank you for all of your assistance and suggestions. I will keep you informed.

Sincerely

Bill Branson

azat-badretdin commented 2 years ago

Thanks for update, Bill!

Bill-Branson commented 2 years ago

Hello Everyone, The issue has been resolved. The user recompiled the yaml file using a diffferent tool. This allowed it to not need the higher number of open files per core.

Thank you for all of your patience and assistance.

Sincerely

Bill Branson

azat-badretdin commented 2 years ago

You are welcome!

ncbi / pgap

PGAP failing with error "PGAP failed, docker exited with rc = 1 error" " permanentFail" #170