radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

example cleanup #2114

Closed andre-merzky closed 4 years ago

andre-merzky commented 4 years ago

@aydinsaribudak and @AymenFJA began to clean up the RP examples - this ticket is to discuss those changes where needed.

This work lives in the fix/travis branch.

andre-merzky commented 4 years ago

From @aydinsaribudak :

/misc/hello_synapse.py (DONE)
1. declaration of cud.cores is removed since it is not in unitmanager schema.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018359.0002
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0009

/misc/ordered_pipelines.py (DONE) 
1. config = ru.read_json('%s/../config.json' %pwd) is the update.
This is to read the config file successfully.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018359.0009
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0010

/misc/profile_analysis.py is removed.
1. This is because radical.pilot.utils is missing these attributes: 'prof2frame' and 'combine_profiles'

/misc/rp_app_comm.py (DONE)
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0003
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0013

/misc/rp_app_master.py 
/misc/rp_app_worker.py are removed because
following are not available in os.environ
'RP_UNIT_ID','RP_WORK_QUEUE_IN', 'RP_RESULT_QUEUE_OUT'
For example, we got error with this line: uid = os.environ['RP_UNIT_ID']

/misc/task_overlay.py
/misc/task_overlay_master.py
/misc/task_overlay_worker.py
1. resource is fixed.
2. import statement is fixed.
However, it requires a second argument (i.e. worker) when I ran the script.
It may work with correct arguments.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0060

/misc/wl_shape_02.py (DONE)
1. pilot.stage_in source path is fixed, and an empty gromacs folder is created 
(maybe this folder should contain some specific scripts similar to those saved 
under ../data/gromacs_mdrun_0/ folder)
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0025

/misc/dynamic_ec2_pilot.py is removed because in order to run this script
following environment variables should be set:
your Amazon EC2 ID, your Amazon EC2 KEY, name of ssh keypair within EC2,
your ssh keypair to use to access the VM.

/misc/colocated.py (DONE)
1. config = ru.read_json('%s/../config.json' %pwd) is the update.
This is to read the config file successfully.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0031

/misc/benchmark_driver.py
1. rp_host is updated as "local.localhost"
2. rp_project "TG-MCB090174" is removed
3. declaration of cud.cores and cud.mpi are removed since they are not in unitmanager schema.
4. unit.execution_locations, unit.start_time, unit.stop_time are excluded from print
statement since unit has not such methods.
5. rp.UnitManager constructor is called only with session (scheduler is excluded).
However, state of ComputePilots are CANCELED. In addition, stats plotter cooments are removed
since bin/radicalpilot-stats is not working
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0056

/misc/backfilling.py
1. pdesc.resource is updated as "local.localhost"
2. pdesc.project is removed
3. wait(state=rp.ACTIVE) is updated as (state=rp.PMGR_ACTIVE)
4. cu.executable='SLEEP' is updated as '/bin/sleep'
However, call_back error is reported in pmgr log file. Further update is needed.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0053

/misc/backfilling_recovery.py 
1. rp.Session argument 'name' is updated 'uid'
2. resource in pilot decription "localhost" is updated as "local.localhost"
3. state argument 'rp.ACTIVE' is updated as 'rp.PMGR_ACTIVE'
4. declaration of cud.cores is removed since it is not in unitmanager schema
5. unit.execution_locations, unit.start_time, unit.stop_time are excluded from print
statement since unit has not such methods.
However, call_back error is reported in umgr log file. Further update is needed.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0052

/misc/gpu_pilot.py
The run is completed but units are failed. Need for confirmation to remove the script.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0061

/misc/running_mpi_executables.py
1. rp_host is updated as "local.localhost"
2. rp_project "TG-MCB090174" is removed
3. declaration of cud.cores and cud.mpi are removed since they are not in unitmanager schema.
4. unit.execution_locations, unit.start_time, unit.stop_time are excluded from print
statement since unit has not such methods.
However, call_back error is reported in umgr log file. Further update is needed.
rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0062

radical-stack
  python               : 3.7.6
  pythonpath           :
  virtualenv           : /home/aydins/env_rptest
  radical.entk         : 1.0.1-v1.0.1-13-gca6384b@devel
  radical.pilot        : 1.2.1
  radical.saga         : 1.2.0
  radical.utils        : 1.2.2
andre-merzky commented 4 years ago

A couple of comments:

aydinsaribudak commented 4 years ago

Hi Andre, Regarding the issues we discussed on slack:

  1. For task_overlay.py Current version of the code calls the task_overlay.Master as below:
    import radical.pilot as rp
    rp.task_overlay.Master

    task_overlay.py is successfully completed. task_overlay_master.py generates this error:

    Traceback (most recent call last):
    File "task_overlay_master.py", line 107, in <module>
    class MyMaster(rp.task_overlay.Master):
    AttributeError: module 'radical.pilot' has no attribute 'task_overlay'

To call the task_overlay.Master, if we do the edit below:

import radical.pilot.task_overlay as rpt
rpt.Master

task_overlay.py is successfully completed. task_overlay_master.py do not generate the same error (skips this one but gives another error).

I prefer to stick with this edit (We import rp. task_overlay as rpt in the code).

  1. For backfilling.py No error message is printed on the console but the following error message is reported in pmgr.0000.log file.
    local.localhost', 'FAILED']>
    1586550428.311 : pmgr.0000            : 32110 : 140709411669760 : DEBUG    : pilot.0000 calls cb <bound method ComputePilot._default_state_cb of ['pilot.0000', 'local.localhost', 'FAILED']>
    1586550428.311 : pmgr.0000            : 32110 : 140709411669760 : INFO     : [Callback]: pilot pilot.0000 state: FAILED.
    1586550428.311 : pmgr.0000            : 32110 : 140709411669760 : ERROR    : [Callback]: pilot 'pilot.0000' failed (exit)
    1586550428.311 : pmgr.0000            : 32110 : 140709411669760 : ERROR    : listener died
    Traceback (most recent call last):
    File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/utils/zmq/pubsub.py", line 315, in _listener
    cb(t, m)
    File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 320, in _state_sub_cb
    if not self._update_pilot(thing, publish=False):
    File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 361, in _update_pilot
    self._pilots[pid]._update(pilot_dict)
    File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 206, in _update
    else      : cb([self])
    File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 152, in _default_state_cb
    ru.cancel_main_thread('int')
    File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/utils/threads.py", line 164, in cancel_main_thread
    sys.exit()
    SystemExit
  2. For backfilling_recovery.py Task unit.000002 state: FAILED, exit code: 2 is printed on the console. In addition, the following error message is reported in umgr.0000.log file.
    1586550943.279 : umgr.0000            : 16038 : 140235505657600 : ERROR    : cb error (unit_state_cb)
    Traceback (most recent call last):
    File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/unit_manager.py", line 539, in _unit_cb
    if cb_data: cb(unit, state, cb_data)
    TypeError: unit_state_cb() takes 2 positional arguments but 3 were given
    1586550943.279 : umgr.0000            : 16038 : 140235505657600 : ERROR    : cb error (wait_queue_size_cb)
    Traceback (most recent call last):
    File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/unit_manager.py", line 539, in _unit_cb
    if cb_data: cb(unit, state, cb_data)
    TypeError: wait_queue_size_cb() takes 2 positional arguments but 3 were given
AymenFJA commented 4 years ago

@andre-merzky I propose to delete the following examples.

andre-merzky commented 4 years ago

I don't mind removing them. Please make sure to also remove the data files which are included for some of them, and please make sure they are not referenced in the documentation. Thanks!

andre-merzky commented 4 years ago

@aydinsaribudak : sorry, I missed your last message I think - is that still open?

aydinsaribudak commented 4 years ago

@andre-merzky Yes, we had action items for a couple of scripts:

backfilling.py backfilling_recovery.py Briefly, I got call back failure messages from these scripts. The logs are reported above.

One another thing is about task_overlay scripts. I commit the update below for importing task_overlay folder.

import radical.pilot.task_overlay as rpt
rpt.Master

Thanks Aydin

andre-merzky commented 4 years ago

@aydinsaribudak: I committed some changes to the backfilling examples which make them work as expected (for me). Can you give them a try again, please?

aydinsaribudak commented 4 years ago

@andre-merzky For backfilling.py The run is successfully completed but I got the following error messages when I go to the session folder:

(env_rptest) aydins@js-17-94:~/radical.pilot/examples/misc/rp.session.js-17-94.jetstream-cloud.org.aydins.018373.0000$ grep -R ERROR *.*
pmgr.0000.log:1587487467.945 : pmgr.0000            : 28928 : 140508110255872 : ERROR    : [Callback]: pilot 'pilot.0000' failed (exit)
pmgr.0000.log:1587487467.945 : pmgr.0000            : 28928 : 140508110255872 : ERROR    : listener died
andre-merzky commented 4 years ago

Is there an exception stack logged near that error message?

aydinsaribudak commented 4 years ago

I couldnt see any exception stack near error message. P.S. here is the radical-stack info that I have in my test environment:

  python               : 3.7.6
  pythonpath           :
  virtualenv           : /home/aydins/env_rptest

  radical.entk         : 1.0.1-v1.0.1-13-gca6384b@devel
  radical.pilot        : 1.2.1
  radical.saga         : 1.2.0
  radical.utils        : 1.2.2

Aymen will give a try as well and will report if the error is replicated in his environment or not.

aydinsaribudak commented 4 years ago

Hi @andre-merzky For backfilling.py : The run is successfully completed and no error message is reported in the session folder. But it is reported on the console that the pilots are cancelled.

        /[Callback]: ComputePilot 'pilot.0000' state: CANCELED.
\[Callback]: ComputePilot 'pilot.0001' state: CANCELED.
|[Callback]: ComputePilot 'pilot.0002' state: CANCELED.

There is no pilot folder created under the session folder.

For backfilling_recovery.py : The run is completed successfully, but we have the following error messages under session folder:

umgr.0000.log:1587616306.373 : umgr.0000            : 18871 : 140660749350656 : ERROR    : cb error (unit_state_cb)
umgr.0000.log:1587616306.399 : umgr.0000            : 18871 : 140660749350656 : ERROR    : cb error (wait_queue_size_cb)
andre-merzky commented 4 years ago

Cancelled: yes, pilots are usually canceled during termination.

cb error: any exception traces near the error logs? @AymenFJA : can you have a look, please?

AymenFJA commented 4 years ago

I started trying to replicate Aydins work. I will update the ticket soon.

AymenFJA commented 4 years ago

Examples cleanup progress update:

mtitov commented 4 years ago

do we have a status update on this ticket?

AymenFJA commented 4 years ago

Since

do we have a status update on this ticket?

By removing molssi and probing i can confirm that the example cleanup is done.