radical-cybertools / radical.saga

A Light-Weight Access Layer for Distributed Computing Infrastructure and Reference Implementation of the SAGA Python Language Bindings.
http://radical-cybertools.github.io/saga-python/
Other
83 stars 34 forks source link

Intermittent RPEX failure (PSI-J/SAGA layer) #885

Closed AymenFJA closed 5 months ago

AymenFJA commented 8 months ago

This is reported here: https://github.com/Parsl/parsl/issues/3029

andre-merzky commented 8 months ago

The problem seems to happen when trying to obtain the full path to the pilot sandbox on the target machine, way before pilot submission even hits the SAGA layer. That failing can means we can't open a shell in SAGA, and I would guess we are hitting the system too hard.
A quick workaround is to not use any env variables in default_remote_workdir for the test configuration (the failing code path attempts to resolve that env var). Also PSIJ should be installed as dependency to skip SAGA on job submission (will likely hit the same error later on).

andre-merzky commented 8 months ago

@AymenFJA : ping

AymenFJA commented 8 months ago

@andre-merzky : I am passing a dir session as follows:

self.session = rp.Session(cfg={'base': self.run_dir},
                                  uid=ru.generate_id('rpex.session',
                                                     mode=ru.ID_PRIVATE))

Is this problematic?

Also, I am using local.localhost which by default sets the default_remote_workdir to $HOME.

Important things to mention:

This is critical as I just saw another test stumbles with this error https://github.com/Parsl/parsl/pull/3053.

yadudoc commented 8 months ago

Wanted to add a quick paste from a recent failure in CI:

self = <radical.saga.utils.pty_process.PTYProcess object at 0x7fe2a4a5b410>
data = " export PROMPT_COMMAND='' PS1='$'; set prompt='$'\n", nolog = False

    def write (self, data, nolog=False) :
        """
        This method will repeatedly attempt to push the given data into the
        child's stdin pipe, until it succeeds to write all data.
        """

        with self.rlock :

            if not self.alive (recover=False) :
                raise ptye.translate_exception(
                        se.NoSuccess("cannot write to dead process (%s) [%5d]"
                                     % (self.cache[-256:], self.parent_in)))

            try :
                log = self._hide_data (data, nolog)
                log =  log.replace ('\n', '\\n')
                log =  log.replace ('\r', '')
                if  len(log) > _DEBUG_MAX :
                    self.logger.debug ("write: [%5d] [%5d] (%s ... %s)"
                                    % (self.parent_in, len(data),
                                       log[:30], log[-30:]))
                else :
                    self.logger.debug ("write: [%5d] [%5d] (%s)"
                                    % (self.parent_in, len(data), log))

                # attempt to write forever -- until we succeeed
                while data :

                    # check if the pty pipe is ready for data
>                   _, wlist, _ = select.select ([], [self.parent_in], [],
                                                 _POLLDELAY)
E                                                ValueError: filedescriptor out of range in select()

Full logs are over here-> https://github.com/Parsl/parsl/actions/runs/7818279731/job/21330235701#step:8:2525

andre-merzky commented 8 months ago

Hi @yadudoc - can you please add psij-python as dependency for the test environment? If psij is installed it should be picked up and be used as default launcher, thus avoiding problems with the SAGA PTY layer which is known to be rather fickle...

AymenFJA commented 8 months ago

A corresponding PR is open now to fix this issue and others: https://github.com/Parsl/parsl/pull/3060

benclifford commented 7 months ago

I dug into this issue quite a bit to try to understand why this was specifically being triggered in our CI.

It looks like in the radical.saga pty code, fork() can return file descriptors that can't be used with select sometimes, because they are greater than glibc's FD_SETSIZE. In this Parsl test case, that's because a bunch of earlier tests have run (in randomised order, which is why this doesn't always happen) leaving a lot of fds open in that main test process: not so many to exhaust the available fds, but enough to move into the fd space where select doesn't work.

benclifford commented 7 months ago

I opened a PR to install psij-python when installing parsl[radical-pilot] - parsl PR #3079. I'll give it a few runs, but if it avoids the above SAGA file descriptor limits and does not bring in new problems, this feels like a more principled approach than PR #3060.

benclifford commented 7 months ago

@andre-merzky I added psij-python to the environment in parsl PR #3079 but I still see the pty select code running and failing: for example scroll down near the end of https://github.com/Parsl/parsl/actions/runs/7928758618/job/21647641006

Is there something else that needs to happen to make this happen: > If psij is installed it should be picked up and be used as default launcher ?

andre-merzky commented 6 months ago

Fix has been released, SAGA is not an RP dependency anymore.

benclifford commented 6 months ago

@andre-merzky what should I be seeing as a newly released component? Our CI still installs radical-pilot 1.47, and PyPI reports 1.47.0 on Feb 8th as the latest release.

andre-merzky commented 6 months ago

@benclifford : pypi should deliver 1.48 by now. Could you please re-trigger the test pipeline? Thanks!

benclifford commented 6 months ago

@andre-merzky that new version seems to completely break radical-pilot+parsl for the usual test run that was working with 1.47: see this run https://github.com/Parsl/parsl/actions/runs/8418695922/job/23054193010?pr=3286 where at the end you can see something (I think radical pilot) sending a ctrl-C to the test process because it is so upset about something. I haven't dug into what's going on there.

andre-merzky commented 6 months ago

Hi @benclifford : the KeyboardInterrupt is issued when a pilot fails. In the logs, I find this:

$ grep -B 4 Error pmgr_launching.0000.log
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/psij/__init__.py", line 13, in <module>
    from .job_executor import JobExecutor
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/psij/job_executor.py", line 3, in <module>
    from distutils.version import Version
ModuleNotFoundError: No module named 'distutils'
--
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/saga.py", line 35, in __init__
    raise rs_ex
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/saga.py", line 15, in <module>
    import radical.saga
ModuleNotFoundError: No module named 'radical.saga'
--
Traceback (most recent call last):
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/base.py", line 306, in work
    self._start_pilot_bulk(resource, schema, pilots)
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/base.py", line 524, in _start_pilot_bulk
    raise RuntimeError('no launcher found for %s' % pilot['uid'])
RuntimeError: no launcher found for pilot.0000

SAGA not being found is intentional - but the distutils problem in PSIJ is what we stumble over.

I am afraid I will be offline for the rest of the week (I am not online today either tbh :-P). This ticket is re-opened now, I'll tend to it first thing on Monday. Also, we should add a parsl integration test to our own test suite so that things don't fall over on your end, but ideally before we push new releases...

andre-merzky commented 6 months ago

PS.: Oh, that was fixed in psij-python, but did not make it into that release yet. pinging @hategan

benclifford commented 6 months ago

ok. For now parsl PR 3290 pins radical.pilot to 1.47 which has been passing our testing for the last months - aside from the problem that this issue #885 is about. We can loosen those constraints later.

hategan commented 6 months ago

PS.: Oh, that was fixed in psij-python, but did not make it into that release yet. pinging @hategan

I just made a 0.9.5 release.

mturilli commented 6 months ago

@benclifford, can we consider this addressed and close this ticket?

benclifford commented 6 months ago

I guess so, if you think it's fixed? I haven't validated it in Parsl because haven't got back to trying it again, but there's a parsl issue https://github.com/Parsl/parsl/issues/3029 to remind me about that...

mturilli commented 6 months ago

@benclifford yes, I think it should be fixed and released now.

benclifford commented 6 months ago

got some problems integrating this with parsl on Python 3.12 with missing distutils - more info on the relevant parsl PR, https://github.com/Parsl/parsl/pull/3286#issuecomment-2042598808

benclifford commented 6 months ago

actually i think this looks the same as @andre-merzky pasted above, https://github.com/radical-cybertools/radical.saga/issues/885#issuecomment-2018074589

I can see that this version of psi-j-python is being installed in our build logs: psij-python-0.9.5

mtitov commented 6 months ago

@benclifford Ben, seems that there is a leftover of disutils in PSI/J, I've created a ticket regarding it https://github.com/ExaWorks/psij-python/issues/462

mtitov commented 6 months ago

@benclifford, our colleague just released a fixed version of PSI/J (0.9.6), would you mind to restart that tests? Thanks!

mturilli commented 5 months ago

I consider it fixed based on @AymenFJA feedback. If this is still a problem, please reopen the ticket.