Closed AymenFJA closed 12 months ago
Thanks @AymenFJA - can you please attach the session tarball?
I think this is the cause of the problem:
$ head -n 1 master.000000.worker.0000.err
srun: Warning: can't honor --ntasks-per-node set to 40 which doesn't match the requested tasks 398 with the number of requested nodes 10. Ignoring --ntasks-per-node.
Something is up with the core count for the worker. Any idea what's up with that?
But I would also like to have a look at the client side logs if you don't mind.
Update:
AGENT_STAGING_INPUT_PENDING
results in timeout
and will be discussed today alongside the ticket itself with Andre in a 1-1 meeting. AGENT_STAGING_INPUT_PENDING
sessions to be shared with him, and it will be on the slack channel.Update: Andre and I had a 1-1 session discussing the issue of RPEX hanging on prepare_env
mainly. The behavior includes RP creating the environment successfully but never returns.
I quote: "The problem is likely that a RPC response message does not make it back to the client after a prepare_env
call, and thus the client hangs"
Andre is currently working on it.
This should be resolved now, the RPC messages get proxied and replied as intended...
I still see the same behavior. I have an open discussion with Andre regarding that. Update: This behavior no longer exist.
Another issue seems to be happening: RAPTOR workers are not being launched.
Checking the agent_executing.0000.log
:
1695402928.154 : agent_executing.0000 : 29156 : 139886204020480 : DEBUG : advance bulk: 1 [True, True, AGENT_STAGING_OUTPUT_PENDING]
1695402928.154 : agent_executing.0000 : 29156 : 139886204020480 : DEBUG : put bulk AGENT_STAGING_OUTPUT_PENDING: 1: agent_staging_output_queue
1695402929.061 : agent_executing.0000 : 29156 : 139886204020480 : INFO : Task master.000000.worker.0000 has return code 1.
1695402929.062 : agent_executing.0000 : 29156 : 139886204020480 : DEBUG : advance bulk: 1 [True, True, AGENT_STAGING_OUTPUT_PENDING]
1695402929.062 : agent_executing.0000 : 29156 : 139886204020480 : DEBUG : put bulk AGENT_STAGING_OUTPUT_PENDING: 1: agent_staging_output_queue
The master.worker
is not running and existing with code 1
Related ticket to behavior in the last comment above: https://github.com/radical-cybertools/radical.pilot/issues/3036
This is tested and it is fixed/done.
I was testing the
@devel_nodb_2
with RPEX integration. I used a simple bag of tasks (executables only) example. My setup is using a pre-existing environment for the client and agent with the following stack:The example hangs forever unless I
ctrl+c
it. I noticed few errors in the agent session: bootstrap_0.outrpex_nodb2_agent.zip