radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

less saga #3131

Closed andre-merzky closed 6 months ago

andre-merzky commented 7 months ago

remove rs.Session as base class for rp.Session. This closes #2674

codecov[bot] commented 7 months ago

Codecov Report

Attention: Patch coverage is 37.35409% with 161 lines in your changes are missing coverage. Please review.

Project coverage is 45.02%. Comparing base (74a8188) to head (bd13797).

Files Patch % Lines
src/radical/pilot/utils/staging_helper.py 58.47% 49 Missing :warning:
src/radical/pilot/utils/session.py 14.63% 35 Missing :warning:
src/radical/pilot/session.py 40.00% 12 Missing :warning:
src/radical/pilot/staging_directives.py 20.00% 12 Missing :warning:
src/radical/pilot/tmgr/staging_input/default.py 14.28% 12 Missing :warning:
src/radical/pilot/tmgr/staging_output/default.py 0.00% 9 Missing :warning:
src/radical/pilot/pilot.py 0.00% 8 Missing :warning:
src/radical/pilot/pilot_manager.py 0.00% 7 Missing :warning:
src/radical/pilot/pmgr/launching/base.py 0.00% 6 Missing :warning:
src/radical/pilot/agent/staging_input/default.py 20.00% 4 Missing :warning:
... and 4 more
Additional details and impacted files ```diff @@ Coverage Diff @@ ## devel #3131 +/- ## ========================================== + Coverage 44.27% 45.02% +0.74% ========================================== Files 97 94 -3 Lines 10515 10261 -254 ========================================== - Hits 4656 4620 -36 + Misses 5859 5641 -218 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mtitov commented 7 months ago

@andre-merzky RS Session's "context" was mentioned in one of the tutorials, which also could be removed

# notebook docs/source/tutorials/submission.ipynb

nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
import radical.pilot as rp

session = rp.Session()

context = rp.Context('ssh')
context.user_id = "user_id"

session.add_context(context)
------------------

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_3499/127513295.py in <module>
      6 context.user_id = "user_id"
      7 
----> 8 session.add_context(context)

AttributeError: 'Session' object has no attribute 'add_context'
review-notebook-app[bot] commented 7 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

andre-merzky commented 7 months ago

thanks @mtitov, done as suggested.

AymenFJA commented 6 months ago

@andre-merzky:

Traceback (most recent call last):
File \"/home/aymen/ve/test_rpex_final/lib/python3.8/site-packages/radical/pilot/tmgr/staging_input/default.py\",
line 285, in work\n    self._handle_task(task, actionables)\n  
File "/home/aymen/ve/test_rpex_final/lib/python3.8/site-packages/radical/pilot/tmgr/staging_input/default.py\", line 334, 
in _handle_task\n    self._fs_cache[key] = rsfs.Directory(tmp)\nNameError: name 'rsfs' is not defined",

maybe:

import radical.saga.filesystem as rsfs
andre-merzky commented 6 months ago

Thanks for spotting that, @AymenFJA - not sure how that crept in again. Either way, removed again now :-)

maybe:

import radical.saga.filesystem as rsfs

Rather not - the idea was to remove saga ;-)

andre-merzky commented 6 months ago

Given the urgency for this PR, I would suggest to review it now. The failing tests are related but not critical: pilot log- and profiles are not staged back just yet. I would suggest to address that in a separate ticket.

AymenFJA commented 6 months ago

@andre-merzky I am trying to run the following example: ~/RADICAL/radical.pilot/examples$ python 05_task_input_data.py and I am getting the following errors:

aymen@surfacebook:~/radical.pilot.sandbox/rp.session.surfacebook.aymen.019795.0001/pilot.0000/task.000028$ cat *.err
/usr/bin/wc: input.dat: No such file or directory
submit tasks

create 128 task description(s)
        ........................................................................
        ........................................................              ok

--------------------------------------------------------------------------------
gather results

  * task.000000: FAIL, exit:   1, out:
  * task.000001: FAIL, exit:   1, out:
  * task.000002: FAIL, exit:   1, out:
  * task.000003: FAIL, exit:   1, out:
  * task.000004: FAIL, exit:   1, out:
  * task.000005: FAIL, exit:   1, out:
  * task.000006: FAIL, exit:   1, out:
  * task.000007: FAIL, exit:   1, out:
  * task.000008: FAIL, exit:   1, out:
  * task.000009: FAIL, exit:   1, out:
  * task.000010: FAIL, exit:   1, out:
  * task.000011: FAIL, exit:   1, out:
  * task.000012: FAIL, exit:   1, out:
  * task.000013: FAIL, exit:   1, out:
  * task.000014: FAIL, exit:   1, out:
  * task.000015: FAIL, exit:   1, out:
  * task.000016: FAIL, exit:   1, out:
  * task.000017: FAIL, exit:   1, out:
  * task.000018: FAIL, exit:   1, out:
  * task.000019: FAIL, exit:   1, out:
  * task.000020: FAIL, exit:   1, out:
  * task.000021: FAIL, exit:   1, out:
tmgr.0000.json:            "exception": "RuntimeError(\"task failed\")",
tmgr.0000.json:            "exception": "RuntimeError(\"task failed\")",
tmgr.0000.json:            "exception": "RuntimeError(\"task failed\")",
tmgr.0000.json:            "exception": "RuntimeError(\"task failed\")",
tmgr.0000.json:            "exception": "RuntimeError(\"task failed\")",
tmgr.0000.json:            "exception": "RuntimeError(\"task failed\")",
tmgr.0000.json:            "exception": "RuntimeError(\"task failed\")",
tmgr.0000.json:            "exception": "RuntimeError(\"task failed\")",
tmgr.0000.json:            "exception": "RuntimeError(\"task failed\")",
tmgr.0000.json:            "exception": "RuntimeError(\"task failed\")",
tmgr.0000.json:            "exception": "RuntimeError(\"task failed\")",
andre-merzky commented 6 months ago

Ah, great catch - schema expansion seems to be off. Fix incoming. Thanks for your diligence and patience with this!

AymenFJA commented 6 months ago

Thanks, @andre-merzky So far, this is the update:

andre-merzky commented 6 months ago

Here several comments, and the rest are inline

* in `bin/radical-pilot-create-static-ve` need to edit default modules (L59-61) - at least one module (e.g., `apache-libcloud`) is SAGA's dependency and is not needed to be installed (not sure about `bootstrap_0.sh`)

Thanks, done (bootstrap_0.sh was clean already)

* need to update `requirements-ci.txt` (remove `saga` and add `psij`)

Thanks, fixed.

* update PilotDescription docstring?

Done.

mtitov commented 6 months ago

one leftover:

~/work/radical.pilot/radical.pilot/testenv/lib/python3.7/site-packages/radical/pilot/pilot_manager.py in cancel_pilots(self, uids, _timeout)
    744                 if uid not in self._pilots:
    745                     raise ValueError('pilot %s not known' % uid)
--> 746                 self._pilots[uid]._finalize()
    747 
    748 

AttributeError: 'Pilot' object has no attribute '_finalize'

(1) https://github.com/radical-cybertools/radical.pilot/blob/8cfe429c83d290efb2c083bf9d677201e537595f/src/radical/pilot/pilot.py#L599

(2) https://github.com/radical-cybertools/radical.pilot/blob/8cfe429c83d290efb2c083bf9d677201e537595f/src/radical/pilot/pilot_manager.py#L746

andre-merzky commented 6 months ago

one leftover:

Oops - sorry for that - fixed!

mtitov commented 6 months ago

(1)

https://github.com/radical-cybertools/radical.pilot/blob/8cfe429c83d290efb2c083bf9d677201e537595f/src/radical/pilot/pilot.py#L597-L599

@andre-merzky sorry, one more (call of the same method, but it appeared in two places) - another left in pilot's cancel method

andre-merzky commented 6 months ago

(1) https://github.com/radical-cybertools/radical.pilot/blob/8cfe429c83d290efb2c083bf9d677201e537595f/src/radical/pilot/pilot.py#L597-L599

@andre-merzky sorry, one more (call of the same method, but it appeared in two places) - another left in pilot's cancel method

Thanks - fixed!

AymenFJA commented 6 months ago

Thanks, @andre-merzky So far, this is the update:

  • [x] Local machine test: passed
  • [x] HPC test: passed

@andre-merzky the tests passed locally and on HPC interactivley (Bridges2).

andre-merzky commented 6 months ago

Thanks, @andre-merzky So far, this is the update:

  • [x] Local machine test: passed
  • [x] HPC test: passed

@andre-merzky the tests passed locally and on HPC interactivley (Bridges2).

Great, thanks @AymenFJA ! Let's wait for @mtitov to confirm on Frontier before merging...

mtitov commented 6 months ago

@andre-merzky with these updates there is no collection of logs and profs files

mtitov commented 6 months ago

Worked correctly for both runs from the interactive job and from the login node on Frontier

Though, to run from login node the following should be fixed:

Also, while running from the login node, we didn't have possibility to control SMT and CoreSpec (as well options exclusive and export)

``` $ scontrol show job 1777165 JobId=1777165 JobName=b38c0cd4-e41e-4c41-b894-c990ab11a00d.job UserId=matitov(16142) GroupId=matitov(28035) MCS_label=N/A Priority=43286400 Nice=0 Account=chm155_003 QOS=normal JobState=CANCELLED Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15 RunTime=00:07:40 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2024-03-20T19:37:27 EligibleTime=2024-03-20T19:37:27 AccrueTime=2024-03-20T19:37:27 StartTime=2024-03-20T19:38:08 EndTime=2024-03-20T19:45:48 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-03-20T19:38:08 Scheduler=Backfill Partition=batch AllocNode:Sid=login12:50002 ReqNodeList=(null) ExcNodeList=(null) NodeList=frontier06913 BatchHost=frontier06913 NumNodes=1 NumCPUs=112 NumTasks=56 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 ReqTRES=cpu=56,mem=1M,node=1,billing=56 AllocTRES=cpu=112,node=1,billing=112 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/ccs/home/matitov/.psij/work/slurm/b38c0cd4-e41e-4c41-b894-c990ab11a00d.job WorkDir=/lustre/orion/scratch/matitov/chm155/flux/radical.pilot.sandbox/rp.session.login12.matitov.019802.0006/pilot.0000 AdminComment=GPU_SRANGE_SRC=default,GPU_POWER_CAP=560,GPU_POWER_CAP_SRC=default,GPU_SRANGE=500-1700 StdErr=/dev/null StdIn=/dev/null StdOut=/dev/null Power= ```

p.s. my test was with the single task

andre-merzky commented 6 months ago

@andre-merzky with these updates there is no collection of logs and profs files

* in my example I've set `session.close(download=True)`, but no files on the client side in `<sessionID>/pilot.0000`

Thanks, this should now work again.

mturilli commented 6 months ago

9083f19 breaks the test of the submission tutorial. We assume (maybe rightfully or wrongly) that each tutorial creates a session. That commit removes the creation of the session from the tutorial. In turn that breaks the github workflow. I am going to commit a fix and we can leave for another time discussing whether we should assume that each tutorial creates a session.

andre-merzky commented 6 months ago

9083f19 breaks the test of the submission tutorial. We assume (maybe rightfully or wrongly) that each tutorial creates a session. That commit removes the creation of the session from the tutorial. In turn that breaks the github workflow. I am going to commit a fix and we can leave for another time discussing whether we should assume that each tutorial creates a session.

Right - the PR removes the Context as that was part of SAGA's security model - which got removed along with SAGA. Incidentally that was the piece of code which created an (empty) session indeed...