radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

testing no-db2 hangs #3021

Closed AymenFJA closed 12 months ago

AymenFJA commented 1 year ago

I was testing the @devel_nodb_2 with RPEX integration. I used a simple bag of tasks (executables only) example. My setup is using a pre-existing environment for the client and agent with the following stack:

  python               : /home/aymen/ve/rpex/bin/python3
  pythonpath           :
  version              : 3.8.10
  virtualenv           : /home/aymen/ve/rpex

  radical.gtod         : 1.20.1
  radical.pilot        : 1.37.0-v1.36.0-600-g989bdc237@devel_nodb_2
  radical.saga         : 1.33.0
  radical.utils        : 1.40.0-v1.33.0-28-g3c2e56a9@devel_nodb_2

The example hangs forever unless I ctrl+c it. I noticed few errors in the agent session: bootstrap_0.out

# Launching radical-pilot-agent
ntphost: 46.101.140.169
PING 46.101.140.169 (46.101.140.169) 56(84) bytes of data.
64 bytes from 46.101.140.169: icmp_seq=1 ttl=40 time=105 ms

--- 46.101.140.169 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 105.203/105.203/105.203/0.000 ms
missing 'src' -- prepare env from current env
agent 14817 is gone
agent 14817 is final
agent 14817 is final (0)
agent_executing.0000.log:1693235930.729 : agent_executing.0000 : 15016 : 140498018748160 : WARNING  : === hb agent_executing.0000 fail  rpex.session.surfacebook.aymen.019597.0002: fatal (15016)
agent_scheduling.0000.log:1693235931.001 : agent_scheduling.0000 : 15041 : 140146108258048 : WARNING  : === hb agent_scheduling.0000 fail  rpex.session.surfacebook.aymen.019597.0002: fatal (15041)
agent_staging_input.0000.log:1693235930.399 : agent_staging_input.0000 : 15094 : 140572845131520 : WARNING  : === hb agent_staging_input.0000 fail  rpex.session.surfacebook.aymen.019597.0002: fatal (15094)
agent_staging_output.0000.log:1693235930.974 : agent_staging_output.0000 : 15143 : 140512891737856 : WARNING  : === hb agent_staging_output.0000 fail  rpex.session.surfacebook.aymen.019597.0002: fatal (15143)

rpex_nodb2_agent.zip

andre-merzky commented 1 year ago

Thanks @AymenFJA - can you please attach the session tarball?

andre-merzky commented 1 year ago

I think this is the cause of the problem:

$ head -n 1 master.000000.worker.0000.err
srun: Warning: can't honor --ntasks-per-node set to 40 which doesn't match the requested tasks 398 with the number of requested nodes 10. Ignoring --ntasks-per-node.

Something is up with the core count for the worker. Any idea what's up with that?

andre-merzky commented 1 year ago

But I would also like to have a look at the client side logs if you don't mind.

AymenFJA commented 1 year ago

Update:

  1. The client side was shared with Andre on the Slack channel.
  2. AGENT_STAGING_INPUT_PENDING results in timeout and will be discussed today alongside the ticket itself with Andre in a 1-1 meeting.
  3. Mikhail requested the AGENT_STAGING_INPUT_PENDING sessions to be shared with him, and it will be on the slack channel.
AymenFJA commented 1 year ago

Update: Andre and I had a 1-1 session discussing the issue of RPEX hanging on prepare_env mainly. The behavior includes RP creating the environment successfully but never returns.

I quote: "The problem is likely that a RPC response message does not make it back to the client after a prepare_env call, and thus the client hangs"

Andre is currently working on it.

andre-merzky commented 1 year ago

This should be resolved now, the RPC messages get proxied and replied as intended...

AymenFJA commented 1 year ago

I still see the same behavior. I have an open discussion with Andre regarding that. Update: This behavior no longer exist.

Another issue seems to be happening: RAPTOR workers are not being launched.

Checking the agent_executing.0000.log:

1695402928.154 : agent_executing.0000 : 29156 : 139886204020480 : DEBUG    : advance bulk: 1 [True, True, AGENT_STAGING_OUTPUT_PENDING]
1695402928.154 : agent_executing.0000 : 29156 : 139886204020480 : DEBUG    : put bulk AGENT_STAGING_OUTPUT_PENDING: 1: agent_staging_output_queue
1695402929.061 : agent_executing.0000 : 29156 : 139886204020480 : INFO     : Task master.000000.worker.0000 has return code 1.
1695402929.062 : agent_executing.0000 : 29156 : 139886204020480 : DEBUG    : advance bulk: 1 [True, True, AGENT_STAGING_OUTPUT_PENDING]
1695402929.062 : agent_executing.0000 : 29156 : 139886204020480 : DEBUG    : put bulk AGENT_STAGING_OUTPUT_PENDING: 1: agent_staging_output_queue

The master.worker is not running and existing with code 1

AymenFJA commented 1 year ago

Related ticket to behavior in the last comment above: https://github.com/radical-cybertools/radical.pilot/issues/3036

AymenFJA commented 12 months ago

This is tested and it is fixed/done.