radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Parsl-RP (RPEX): MPI RP local test hangs on Parsl CI. #3116

Closed AymenFJA closed 8 months ago

AymenFJA commented 8 months ago

This is the main ticket: https://github.com/Parsl/parsl/issues/3013

A possible cause is due to the hanging process from RP. This assumption still needs to be confirmed.

Note: A call is proposed today at 11 EST to discuss this issue with Kevin from the Globus team.

AymenFJA commented 8 months ago

Update: the test was hanging as the CI test env. was not set correctly, leading to a missing mpi4py as the RP runtime system requires it.

andre-merzky commented 8 months ago

Update: the test was hanging as the CI test env. was not set correctly, leading to a missing mpi4py was is required by the RP runtime system.

Why would the test hang instead of fail on a missing mpi4py - is that an RP problem?

AymenFJA commented 8 months ago

@andre-merzky Yes, this is related to termination. I am opening a ticket soon regarding this matter:

  1. If a failure happens on the initialization level of the Master/Worker of RP, RP does not terminate.
  2. If a failure happens, leftover processes generally hang at least the rp.agent labeled process.
AymenFJA commented 8 months ago

Closing this in favor of https://github.com/radical-cybertools/radical.pilot/issues/3119