Open YehorYudinIPP opened 1 year ago
Hi Yehor,
could you check if there is .qcg***
directory in the working directory (the same dir where the api.log file is present). Please send the content of this directory to us - it would help us a lot to check what is wrong.
Hi Bartek,
thanks for the reply, sure, there is a .qcg*
folder for this run.
I am attaching an archive of it here
qcgpjm-service-co3003.8291.tar.gz
After the analysis it looks like there was a problem with starting one of 220 QCG-PJ agents that are associated with particular nodes, however it is difficult to deduce from the logs what was the reason why it didn't start up. Nevertheless, to overcome such issues, QCGPJ provides a special configuration option --nl-ready-treshold=NUM
(yes, treshold
without first h
- we will fix this typo soon) where NUM
is a number from 0 to 1.0, that allows you to define a fraction of agents that need to be started to launch computations. By default it is set to 1.0, which means that all agents need to be ready, but you can set it to something smaller, e.g. to 0.95 to require only 95% of agents to be ready.
In order to add this configuration option to QCGPJ, you need to create your own QCGPJExecutor
object providing the --nl-ready-treshold
option in other_args
list to the constructor. Then, inside the EasyVVUQ part, QCGPJExecutor
can be passed to the QCGPJPool
constructor as qcgpj_executor
parameter.
I hope it will help to mitigate your issues. If something is not clear please write.
Thanks a lot @bartoszbosak ! I had to make a break for some other work, but now I am back to this issue. I have a very a basic Python question on how to pass the arguments you described. Should the constructor call for QCGPJExecutor
look like this?:
QCGPJExecutor(log_level=log_level, other_args=['--nl-ready-treshold=0.95'])
Yes, it should work as you presented. Here is the reference for QCGPJExecutor: https://qcg-pilotjob.readthedocs.io/en/develop/api/qcg.pilotjob.executor_api.qcgpj_executor.html#module-qcg.pilotjob.executor_api.qcgpj_executor
QCGPJExecutor is a wrapper over LocalManager and as you can see it takes a list of strings as other_args
. The full reference of possible options is available here: https://github.com/psnc-qcg/QCG-PilotJob/blob/2059c9fc36a913930010e43b515a5558ecd083e3/components/core/qcg/pilotjob/api/manager.py#L832
Thanks again, I have already briefly looked into the source code for the class. I must be doing something wrong, because
QCGPJExecutor(log_level=log_level, other_args=['--nl-ready-treshold=0.95'])
gives me the following error:
qcgpj_executor=QCGPJExecutor(log_level=log_level, other_args=['--nl-ready-treshold=0.95']),
^
SyntaxError: positional argument follows keyword argument
!>> Exception during batch execution! :
__init__() got an unexpected keyword argument 'other_args'
However, the version of:
qcgpj_executor=QCGPJExecutor(log_level=log_level, ['--nl-ready-treshold=0.95']),
gives me an error of:
qcgpj_executor=QCGPJExecutor(log_level=log_level, ['--nl-ready-treshold=0.95']),
^
SyntaxError: positional argument follows keyword argument
Is it the way Python's positional-keyword-list-dictionary function arguments should work?
Oh... you are right, this is the positional argument, so it won't work in a way I thought (sorry, I do not use this option frequently).
I suppose the solution may be to provide the full list of preceding arguments in the constructor and finally '--nl-ready-treshold=0.95' as the last one.
Something like this: QCGPJExecutor(wd='.', resources=None, reserve_core=False, enable_rt_stats=False, wrapper_rt_stats=None, log_level='info', '--nl-ready-treshold=0.95')
Let me know if it helps.
Actually, I also tried it, and it returns the same:
'--nl-ready-treshold=0.95'),
^
SyntaxError: positional argument follows keyword argument
QCGPJExecutor('.', None, False, False, None, log_level, '--nl-ready-threshold=0.95')
should do it. It's a tricky thing that Python seemingly doesn't have a good solution for...
Thank you! That seems to work, at least locally. I will get back when the SLURM queue will let me in and when I'll get the runs' results, or new errors
Yeah, thanks @LourensVeen! It is even a bit logical;) Only one minor comment from my side that there should be still treshold, not threshold ;)
I fixed the type though, and an error appeared down the line:
usage: gem_multi_ft.py [-h] [--net] [--net-port NET_PORT] [--net-pub-port NET_PUB_PORT]
[--net-port-min NET_PORT_MIN] [--net-port-max NET_PORT_MAX] [--file]
[--file-path FILE_PATH] [--wd WD] [--envschema ENVSCHEMA] [--resources RESOURCES]
[--report-format REPORT_FORMAT] [--report-file REPORT_FILE] [--nodes NODES]
[--log {critical,error,warning,info,debug,notset}] [--system-core] [--disable-nl]
[--show-progress] [--governor] [--parent PARENT] [--id ID] [--tags TAGS]
[--slurm-partition-nodes SLURM_PARTITION_NODES]
[--slurm-limit-nodes-range-begin SLURM_LIMIT_NODES_RANGE_BEGIN]
[--slurm-limit-nodes-range-end SLURM_LIMIT_NODES_RANGE_END]
[--slurm-resources-file SLURM_RESOURCES_FILE] [--resume RESUME]
[--enable-proc-stats] [--enable-rt-stats] [--wrapper-rt-stats WRAPPER_RT_STATS]
[--nl-init-timeout NL_INIT_TIMEOUT] [--nl-ready-treshold NL_READY_TRESHOLD]
[--disable-pub] [--nl-start-method NL_START_METHOD]
gem_multi_ft.py: error: unrecognized arguments: ('--nl-ready-treshold',)
Looks like the other_args* are interpreted as a tuple at some point...
Could you give a try QCGPJExecutor('.', None, False, False, None, log_level, '--nl-ready-threshold', 0.95)
?
If this doesn't help, I think we may need @pkopta
I have tried it already and it gave me:
usage: gem_multi_ft.py [-h] [--net] [--net-port NET_PORT] [--net-pub-port NET_PUB_PORT]
[--net-port-min NET_PORT_MIN] [--net-port-max NET_PORT_MAX] [--file]
[--file-path FILE_PATH] [--wd WD] [--envschema ENVSCHEMA] [--resources RESOURCES]
[--report-format REPORT_FORMAT] [--report-file REPORT_FILE] [--nodes NODES]
[--log {critical,error,warning,info,debug,notset}] [--system-core] [--disable-nl]
[--show-progress] [--governor] [--parent PARENT] [--id ID] [--tags TAGS]
[--slurm-partition-nodes SLURM_PARTITION_NODES]
[--slurm-limit-nodes-range-begin SLURM_LIMIT_NODES_RANGE_BEGIN]
[--slurm-limit-nodes-range-end SLURM_LIMIT_NODES_RANGE_END]
[--slurm-resources-file SLURM_RESOURCES_FILE] [--resume RESUME]
[--enable-proc-stats] [--enable-rt-stats] [--wrapper-rt-stats WRAPPER_RT_STATS]
[--nl-init-timeout NL_INIT_TIMEOUT] [--nl-ready-treshold NL_READY_TRESHOLD]
[--disable-pub] [--nl-start-method NL_START_METHOD]
gem_multi_ft.py: error: unrecognized arguments: ('--nl-ready-treshold', 0.95)
I am quite puzzled about that now
I am using a QCGPJExecutor within an EasyVVUQ campaign at a SLURM submission of 220 nodes at MPCDF's COBRA, and during the start of the execution I encounter an issue that my jobs do not start within first 5 minutes.
My script catches a following exception:
Service not started in 300.3
And the api.log contains the following last lines:
I could not retrieve further information on errors.
Could it be solved by creating a LocalManager and passing it something like {'init_timeout': 600}, or should the timeout be better changed by some environmental variable? Can the source of the issue be not a short timeout but some silent errors at this stage, did anyone encounter something like this? I would appreciate any information on what could be a solution, before just trying out many things myself, as now there are substantial waiting times at this HPC system.