Closed AymenFJA closed 5 months ago
The problem seems to happen when trying to obtain the full path to the pilot sandbox on the target machine, way before pilot submission even hits the SAGA layer. That failing can means we can't open a shell in SAGA, and I would guess we are hitting the system too hard.
A quick workaround is to not use any env variables in default_remote_workdir
for the test configuration (the failing code path attempts to resolve that env var). Also PSIJ should be installed as dependency to skip SAGA on job submission (will likely hit the same error later on).
@AymenFJA : ping
@andre-merzky : I am passing a dir
session as follows:
self.session = rp.Session(cfg={'base': self.run_dir},
uid=ru.generate_id('rpex.session',
mode=ru.ID_PRIVATE))
Is this problematic?
Also, I am using local.localhost
which by default sets the default_remote_workdir
to $HOME
.
Important things to mention:
default_remote_workdir
to the self.run_dir
for 2 reasons:
This is critical as I just saw another test stumbles with this error https://github.com/Parsl/parsl/pull/3053.
Wanted to add a quick paste from a recent failure in CI:
self = <radical.saga.utils.pty_process.PTYProcess object at 0x7fe2a4a5b410>
data = " export PROMPT_COMMAND='' PS1='$'; set prompt='$'\n", nolog = False
def write (self, data, nolog=False) :
"""
This method will repeatedly attempt to push the given data into the
child's stdin pipe, until it succeeds to write all data.
"""
with self.rlock :
if not self.alive (recover=False) :
raise ptye.translate_exception(
se.NoSuccess("cannot write to dead process (%s) [%5d]"
% (self.cache[-256:], self.parent_in)))
try :
log = self._hide_data (data, nolog)
log = log.replace ('\n', '\\n')
log = log.replace ('\r', '')
if len(log) > _DEBUG_MAX :
self.logger.debug ("write: [%5d] [%5d] (%s ... %s)"
% (self.parent_in, len(data),
log[:30], log[-30:]))
else :
self.logger.debug ("write: [%5d] [%5d] (%s)"
% (self.parent_in, len(data), log))
# attempt to write forever -- until we succeeed
while data :
# check if the pty pipe is ready for data
> _, wlist, _ = select.select ([], [self.parent_in], [],
_POLLDELAY)
E ValueError: filedescriptor out of range in select()
Full logs are over here-> https://github.com/Parsl/parsl/actions/runs/7818279731/job/21330235701#step:8:2525
Hi @yadudoc - can you please add psij-python
as dependency for the test environment? If psij is installed it should be picked up and be used as default launcher, thus avoiding problems with the SAGA PTY layer which is known to be rather fickle...
A corresponding PR is open now to fix this issue and others: https://github.com/Parsl/parsl/pull/3060
I dug into this issue quite a bit to try to understand why this was specifically being triggered in our CI.
It looks like in the radical.saga pty code, fork() can return file descriptors that can't be used with select
sometimes, because they are greater than glibc's FD_SETSIZE
. In this Parsl test case, that's because a bunch of earlier tests have run (in randomised order, which is why this doesn't always happen) leaving a lot of fds open in that main test process: not so many to exhaust the available fds, but enough to move into the fd space where select
doesn't work.
I opened a PR to install psij-python when installing parsl[radical-pilot]
- parsl PR #3079. I'll give it a few runs, but if it avoids the above SAGA file descriptor limits and does not bring in new problems, this feels like a more principled approach than PR #3060.
@andre-merzky I added psij-python to the environment in parsl PR #3079 but I still see the pty select code running and failing: for example scroll down near the end of https://github.com/Parsl/parsl/actions/runs/7928758618/job/21647641006
Is there something else that needs to happen to make this happen: > If psij is installed it should be picked up and be used as default launcher ?
Fix has been released, SAGA is not an RP dependency anymore.
@andre-merzky what should I be seeing as a newly released component? Our CI still installs radical-pilot 1.47, and PyPI reports 1.47.0 on Feb 8th as the latest release.
@benclifford : pypi should deliver 1.48 by now. Could you please re-trigger the test pipeline? Thanks!
@andre-merzky that new version seems to completely break radical-pilot+parsl for the usual test run that was working with 1.47: see this run https://github.com/Parsl/parsl/actions/runs/8418695922/job/23054193010?pr=3286 where at the end you can see something (I think radical pilot) sending a ctrl-C to the test process because it is so upset about something. I haven't dug into what's going on there.
Hi @benclifford : the KeyboardInterrupt
is issued when a pilot fails. In the logs, I find this:
$ grep -B 4 Error pmgr_launching.0000.log
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/psij/__init__.py", line 13, in <module>
from .job_executor import JobExecutor
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/psij/job_executor.py", line 3, in <module>
from distutils.version import Version
ModuleNotFoundError: No module named 'distutils'
--
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/saga.py", line 35, in __init__
raise rs_ex
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/saga.py", line 15, in <module>
import radical.saga
ModuleNotFoundError: No module named 'radical.saga'
--
Traceback (most recent call last):
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/base.py", line 306, in work
self._start_pilot_bulk(resource, schema, pilots)
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/base.py", line 524, in _start_pilot_bulk
raise RuntimeError('no launcher found for %s' % pilot['uid'])
RuntimeError: no launcher found for pilot.0000
SAGA not being found is intentional - but the distutils
problem in PSIJ is what we stumble over.
I am afraid I will be offline for the rest of the week (I am not online today either tbh :-P). This ticket is re-opened now, I'll tend to it first thing on Monday. Also, we should add a parsl integration test to our own test suite so that things don't fall over on your end, but ideally before we push new releases...
PS.: Oh, that was fixed in psij-python
, but did not make it into that release yet. pinging @hategan
ok. For now parsl PR 3290 pins radical.pilot to 1.47 which has been passing our testing for the last months - aside from the problem that this issue #885 is about. We can loosen those constraints later.
PS.: Oh, that was fixed in
psij-python
, but did not make it into that release yet. pinging @hategan
I just made a 0.9.5 release.
@benclifford, can we consider this addressed and close this ticket?
I guess so, if you think it's fixed? I haven't validated it in Parsl because haven't got back to trying it again, but there's a parsl issue https://github.com/Parsl/parsl/issues/3029 to remind me about that...
@benclifford yes, I think it should be fixed and released now.
got some problems integrating this with parsl on Python 3.12 with missing distutils
- more info on the relevant parsl PR, https://github.com/Parsl/parsl/pull/3286#issuecomment-2042598808
actually i think this looks the same as @andre-merzky pasted above, https://github.com/radical-cybertools/radical.saga/issues/885#issuecomment-2018074589
I can see that this version of psi-j-python is being installed in our build logs: psij-python-0.9.5
@benclifford Ben, seems that there is a leftover of disutils
in PSI/J, I've created a ticket regarding it https://github.com/ExaWorks/psij-python/issues/462
@benclifford, our colleague just released a fixed version of PSI/J (0.9.6), would you mind to restart that tests? Thanks!
I consider it fixed based on @AymenFJA feedback. If this is still a problem, please reopen the ticket.
This is reported here: https://github.com/Parsl/parsl/issues/3029