radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

Feature/service startup flag #3026

Closed mtitov closed 1 year ago

mtitov commented 1 year ago

@dyokelson hi Dewi, can you please try this branch to test services startup

(1) Installation of the required branches within your virtual environment (uninstall previous radical tools first)

pip install git+https://github.com/radical-cybertools/radical.utils.git@devel_nodb_2
pip install git+https://github.com/radical-cybertools/radical.pilot.git@feature/service_startup_flag

(2) Service task description - when you've described service task, set the metadata attribute as following


service_task = rp.TaskDescription()
...
service_task.metadata = {
    'name': 'soma_00',  # in RP Registry its URL will be accessed with path: 
                        #   "service.soma_00.<idx>.url"
                        # where <idx> refers to the instance id
    'startup_file': <full_path>  # for now try with the full path outside of the pilot sandbox
}
...
pd = rp.PilotDescription()
...
pd.services = [service_task]
dyokelson commented 1 year ago

@mtitov I installed a new python environment with the following radical-stack:

python               : /CSC_CONTAINER/miniconda/envs/env1/bin/python3
  pythonpath           : 
  version              : 3.11.4
  virtualenv           : env1

  radical.gtod         : 1.20.1
  radical.pilot        : 1.37.0-v1.36.0-621-gfdf0e8e9e@feature-service_startup_flag
  radical.saga         : 1.36.0
  radical.utils        : 1.40.0-v1.33.0-32-g9eb1b32@devel_nodb_2

Then I updated the python script I think as suggested. However, I'm getting an error before it hits any of the new service metadata code. I'm wondering if the new install isn't working, or if I missed something I need to change in the script:

Traceback (most recent call last):
  File "/projappl/project_2006549/radical-pilot/rp_soma.py", line 9, in <module>
    session = rp.Session()
              ^^^^^^^^^^^^
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.11/site-packages/radical/pilot/session.py", line 195, in __init__
    self._init_primary()
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.11/site-packages/radical/pilot/session.py", line 249, in _init_primary
    self._init_cfg_from_scratch()
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.11/site-packages/radical/pilot/session.py", line 395, in _init_cfg_from_scratch
    while isinstance(rcfg['schemas'][schema], str):
                     ~~~~~~~~~~~~~~~^^^^^^^^
TypeError: list indices must be integers or slices, not str

Or is it the resource config? I still have the local one for mahti that we updated - I don't think I am using the one that was checked in...unless that happens automatically now with these new branches, but they should amount to the same config either way.

mtitov commented 1 year ago

Or is it the resource config? I still have the local one for mahti that we updated - I don't think I am using the one that was checked in...unless that happens automatically now with these new branches, but they should amount to the same config either way.

oh, right, that one is read automatically, you can delete your local one (with the new no-mongodb branch, resource config structure got a little changed, and I've adjusted one that you've merged, thus this branch already includes csc.mahti with the correct structure)

dyokelson commented 1 year ago

@mtitov Here is the new error we are getting and the task description information:

================================================================================
 Getting Started (RP version 1.37.0)                                            
================================================================================

new session: [rp.session.c1102.mahti.csc.fi.dewiy.019608.0003]                 \
zmq proxy  : [tcp://10.141.32.52:10001]                                       ok
create pilot manager                                                          ok

--------------------------------------------------------------------------------
submit pilot                                                                    

submit 1 pilot(s)
        pilot.0000   csc.mahti                 1 cores       0 gpus           ok
create task managerTraceback (most recent call last):
  File "/projappl/project_2006549/radical-pilot/rp_soma.py", line 56, in <module>
    tmgr = rp.TaskManager(session=session)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.11/site-packages/radical/pilot/task_manager.py", line 160, in __init__
    self._cmgr.start_components(self._cfg.components)
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.11/site-packages/radical/pilot/utils/component_manager.py", line 191, in start_components
    out, err, ret = ru.sh_callout(cmd, cwd=self._cfg.path)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.11/site-packages/radical/utils/shell.py", line 64, in sh_callout
    stdout, stderr = p.communicate()
                     ^^^^^^^^^^^^^^^
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.11/subprocess.py", line 1209, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.11/subprocess.py", line 2108, in _communicate
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/CSC_CONTAINER/miniconda/envs/env1/lib/python3.11/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
session = rp.Session()

pmgr = rp.PilotManager(session=session)

td = rp.TaskDescription()

td.pre_exec = ['module load gcc/9.4.0 openmpi/4.1.2-cuda cuda cmake', 'export SOMA_SERVER_ADDR_FILE=/projappl/project_2006549/radical-pilot/server.add', 'export SOMA_NODE_ADDR_FILE=/projappl/project_2006549/radical-pilot/node.add', 'export SOMA_NUM_SERVER_INSTANCES=1', 'export SOMA_NUM_SERVERS_PER_INSTANCE=1', 'export SOMA_SERVER_INSTANCE_ID=0' ]
td.executable = '/projappl/project_2006549/soma-collector/build/examples/example-server'
td.arguments      = ['-a', 'ofi+verbs://']
td.ranks          =  1
td.cores_per_rank =  1

td.metadata = {
    'name': 'soma_00',  # in RP Registry its URL will be accessed with path: 
                        #   "service.soma_00.<idx>.url"
                        # where <idx> refers to the instance id
    'startup_file': '/projappl/project_2006549/radical-pilot/server.add'  # for now try with the full path outside of the pilot sandbox
}

pd_init = {'resource'     : 'csc.mahti',
           'runtime'      : 5,  # pilot runtime minutes
           'exit_on_error': True,
           'project'      : 'project_2006549',
           'queue'        : 'test',
           'cores'        : 1,
           'gpus'         : 0,
           'access_schema': 'interactive',
           'services'     : [td]}

pdesc = rp.PilotDescription(pd_init)
pdesc.services = [td]
dyokelson commented 1 year ago

Hi, I have been able to launch SOMA as a service on one node and LULESH as a task on another - and they have been able to connect and send/receive data. Next I will be trying with the tau performance data layer and with different resource allocations (multiple nodes/tasks etc).

codecov[bot] commented 1 year ago

Codecov Report

Merging #3026 (57bfad7) into devel (7b1d9d9) will increase coverage by 0.34%. The diff coverage is 50.00%.

@@            Coverage Diff             @@
##            devel    #3026      +/-   ##
==========================================
+ Coverage   42.41%   42.76%   +0.34%     
==========================================
  Files          99       99              
  Lines       10824    10858      +34     
==========================================
+ Hits         4591     4643      +52     
+ Misses       6233     6215      -18     
Files Coverage Δ
src/radical/pilot/agent/agent_0.py 40.54% <50.00%> (-1.97%) :arrow_down:

... and 2 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more