radical-cybertools / radical.entk

The RADICAL Ensemble Toolkit
https://radical-cybertools.github.io/entk/index.html
Other
28 stars 17 forks source link

radical incompatible with setuptools 69.5.1 #664

Closed JMGilbert closed 6 months ago

JMGilbert commented 6 months ago

I've been running FACTS for a while on the same environment and I recently cleared my radical pilot sandbox including the virtual environment. When I ran FACTS again, it crashed and the radical session bootstrap_0.out contained an error that it had failed to install any of the radical packages.

I managed to track this down to setuptools by creating a fresh virtual environment, then executing pip install setuptools==69.0.2 and pip install radical.entk. This worked. Then I created another fresh virtual environment, executed pip install setuptools --upgrade and pip install radical.entk which returned an error.

I think because radical installs the most recent version of setuptools in the virtual environment, it then fails to install the radical packages and the session crashes.

andre-merzky commented 6 months ago

Thanks Jonah - we got surprised by that break also. Setuptools changed the naming schema for sdist packages, and that broke our setup. I just pushed out new releases (1.52.0) for the whole RCT stack which should resolve the problem, could you please give it a spin and report back? Thanks!

JMGilbert commented 6 months ago

Thanks for the quick response! The installation is now working, but I'm having a separate issue with the newest versions of the radical packages. When I try to run FACTS, it fails out instantly with the error:

Traceback (most recent call last):
  File "/home/jonahmgilbert/miniconda3/envs/facts/lib/python3.11/site-packages/radical/entk/appman/appmanager.py", line 459, in run
    self._rmgr.submit_resource_request()
  File "/home/jonahmgilbert/miniconda3/envs/facts/lib/python3.11/site-packages/radical/entk/execman/rp/resource_manager.py", line 225, in submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/home/jonahmgilbert/miniconda3/envs/facts/lib/python3.11/site-packages/radical/pilot/pilot.py", line 589, in wait
    time.sleep(0.1)
KeyboardInterrupt

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/c/Users/jonahmgilbert/Documents/GitHub/facts/runFACTS.py", line 193, in <module>
    run_experiment(args.edir, args.debug, args.alt_id, resourcedir=args.resourcedir, makeshellscript = args.shellscript, globalopts = args.global_options)
  File "/mnt/c/Users/jonahmgilbert/Documents/GitHub/facts/runFACTS.py", line 86, in run_experiment
    amgr.run()
  File "/home/jonahmgilbert/miniconda3/envs/facts/lib/python3.11/site-packages/radical/entk/appman/appmanager.py", line 485

, in run
    raise EnTKError(ex) from ex
radical.entk.exceptions.EnTKError

When I go to check my re.session, re.session.BFI-33300.jonahmgilbert.019828.0001.zip, it's pretty much empty. I tried running the bootstrap_0.sh and I see:

bash ~/radical.pilot.sandbox/re.session.BFI-33300.jonahmgilbert.019828.0001/pilot.0000/bootstrap_0.sh
DeprecationWarning: 'source deactivate' is deprecated. Use 'conda deactivate'.
/etc/profile.d/conda.sh: No such file or directorytivate: line 6: C:/Users/jonahmgilbert/Miniconda3
: numeric argument requiredMiniconda3/Scripts/deactivate: line 6: return: 1
# -------------------------------------------------------------------
bootstrap_0 running on host: BFI-33300.ad.uchicago.edu.
bootstrap_0 started as     : '/home/jonahmgilbert/radical.pilot.sandbox/re.session.BFI-33300.jonahmgilbert.019828.0001/pilot.0000/bootstrap_0.sh '
safe environment of bootstrap_0
bootstrap_0 stderr redirected to stdout
https://files.pythonhosted.org/packages/1c/c2/7516ea983fc37cec2128e7cb0b2b516125a478f8fc633b8f5dfa849f13f7/virtualenv-16.7.12.tar.gz
# -------------------------------------------------------------------
# untar sandbox
# -------------------------------------------------------------------
tar (child): ../: Cannot read: Is a directory
tar (child): At beginning of tape, quitting now
tar (child): Error is not recoverable: exiting now

gzip: stdin: unexpected end of file
tar: Child returned status 2
tar: Error is not recoverable: exiting now
# -------------------------------------------------------------------
create gtod, prof
1713205080.531938,sync_abs,bootstrap_0,MainThread,,PMGR_ACTIVE_PENDING,BFI-33300:172.26.75.24:1713205080.531938:1713205080.531938:1713205080.531938
VIRTENV           :
mkdir: cannot create directory ‘’: No such file or directory
VIRTENV normalized: /home/jonahmgilbert
missing RUNTIME

I've tried this from both a conda environment and a default python environment with the same error. Not sure if this is related or not.

andre-merzky commented 6 months ago

Yes, indeed - our last release solves the pip install problem, but some second order deployment issue (from the same setuptools update) keep popping up. The RP branch hotfix/deployment tries to address these issues. The PR radical-cybertools/radical.pilot/pull/3169 is still work in progress, but I hope it will converge in the next 24 hours and we can release again.

I am really sorry for the problems - the setuptools upgrade hit us from nowhere ... :-(

andre-merzky commented 6 months ago

That PR should be in a working state now.

AlexReedy commented 6 months ago

Hey @JMGilbert I have been working with @andre-merzky and @mturilli on this for the past few days as well. Sorry I was unware of the most recent update. While it's good we discovered it, the flip side of this issue as well is that whatever setup we have in FACTS is not passing the information to the pilot to recognize that it should use the ve active when launched instead of creating a new one. This is also working on being resolved.

@andre-merzky I have sent @mturilli the new docker file we use, as well as will test the new versions today.... This explains why earlier when I checked the RCt versions and they were all 1.52 hahah

AlexReedy commented 6 months ago

Hey @andre-merzky and @mturilli Just tried with the new stack versions and am still getting the error: EnTK session: re.session.fce5772a-fc1b-11ee-81a7-0242ac110002 Creating AppManager Setting up ZMQ queues ok AppManager initialized ok Validating and assigning resource manager ok ** STEP: climate_step ** Setting up ZMQ queues n/a All components terminated Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/radical/entk/appman/appmanager.py", line 459, in run self._rmgr.submit_resource_request() File "/usr/local/lib/python3.8/dist-packages/radical/entk/execman/rp/resource_manager.py", line 225, in submit_resource_request self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED]) File "/usr/local/lib/python3.8/dist-packages/radical/pilot/pilot.py", line 589, in wait time.sleep(0.1) KeyboardInterrupt

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "runFACTS.py", line 193, in run_experiment(args.edir, args.debug, args.alt_id, resourcedir=args.resourcedir, makeshellscript = args.shellscript, globalopts = args.global_options) File "runFACTS.py", line 86, in run_experiment amgr.run() File "/usr/local/lib/python3.8/dist-packages/radical/entk/appman/appmanager.py", line 485, in run raise EnTKError(ex) from ex radical.entk.exceptions.EnTKError

it also appears that bootstrap_0.out does not get created either so I can't check to see if it's the same warning/error internally (factsVe) jovyan@08654797330c:~/radical.pilot.sandbox$ find re.session.fce5772a-fc1b-11ee-81a7-0242ac110002/pilot.0000/ re.session.fce5772a-fc1b-11ee-81a7-0242ac110002/pilot.0000/ re.session.fce5772a-fc1b-11ee-81a7-0242ac110002/pilot.0000/agent_0.cfg re.session.fce5772a-fc1b-11ee-81a7-0242ac110002/pilot.0000/radical-utils-env.sh re.session.fce5772a-fc1b-11ee-81a7-0242ac110002/pilot.0000/location.lst re.session.fce5772a-fc1b-11ee-81a7-0242ac110002/pilot.0000/bootstrap_0.sh

andre-merzky commented 6 months ago

with the new stack versions

@AlexReedy : does that mean the 1.52 release or the RP branch (hotfix/deployment) mentioned in this thread?

AlexReedy commented 6 months ago

@andre-merzky ah sorry I thought they had been grouped together now, that was 1.52 (at least in docker it was still failing), i'll try the branch now

AlexReedy commented 6 months ago

@andre-merzky @mturilli @JMGilbert looks like running with the rp hotfix branch is working!

AlexReedy commented 6 months ago

Note: This is where rp just creates ve.localhost not forcing the launch ve, which still seems to not want to work for me

andre-merzky commented 6 months ago

Thanks for checking, @AlexReedy !

Note: This is where rp just creates ve.localhost not forcing the launch ve, which still seems to not want to work for me

Can you please expand on the above? Am I interpreting correctly that for you case it works if the pilot agent is running in it's own VE which the pilot bootstrapper creates, but fails when the pilot tries to use the client side VE?

AlexReedy commented 6 months ago

No it runs through using the pilot created ve everytime, I can't seem to get it to run the client side ve, but this is not quite related to this git issue. I will do some more testing. For this issue with setuptools, everything seems to be working fine!

mturilli commented 6 months ago

This seems to be successfully completed and released. Pls reopen if you have an issue with setuptools.