Closed andre-merzky closed 4 years ago
From @aydinsaribudak :
/misc/hello_synapse.py (DONE)
1. declaration of cud.cores is removed since it is not in unitmanager schema.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018359.0002
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0009
/misc/ordered_pipelines.py (DONE)
1. config = ru.read_json('%s/../config.json' %pwd) is the update.
This is to read the config file successfully.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018359.0009
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0010
/misc/profile_analysis.py is removed.
1. This is because radical.pilot.utils is missing these attributes: 'prof2frame' and 'combine_profiles'
/misc/rp_app_comm.py (DONE)
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0003
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0013
/misc/rp_app_master.py
/misc/rp_app_worker.py are removed because
following are not available in os.environ
'RP_UNIT_ID','RP_WORK_QUEUE_IN', 'RP_RESULT_QUEUE_OUT'
For example, we got error with this line: uid = os.environ['RP_UNIT_ID']
/misc/task_overlay.py
/misc/task_overlay_master.py
/misc/task_overlay_worker.py
1. resource is fixed.
2. import statement is fixed.
However, it requires a second argument (i.e. worker) when I ran the script.
It may work with correct arguments.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0060
/misc/wl_shape_02.py (DONE)
1. pilot.stage_in source path is fixed, and an empty gromacs folder is created
(maybe this folder should contain some specific scripts similar to those saved
under ../data/gromacs_mdrun_0/ folder)
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0025
/misc/dynamic_ec2_pilot.py is removed because in order to run this script
following environment variables should be set:
your Amazon EC2 ID, your Amazon EC2 KEY, name of ssh keypair within EC2,
your ssh keypair to use to access the VM.
/misc/colocated.py (DONE)
1. config = ru.read_json('%s/../config.json' %pwd) is the update.
This is to read the config file successfully.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0031
/misc/benchmark_driver.py
1. rp_host is updated as "local.localhost"
2. rp_project "TG-MCB090174" is removed
3. declaration of cud.cores and cud.mpi are removed since they are not in unitmanager schema.
4. unit.execution_locations, unit.start_time, unit.stop_time are excluded from print
statement since unit has not such methods.
5. rp.UnitManager constructor is called only with session (scheduler is excluded).
However, state of ComputePilots are CANCELED. In addition, stats plotter cooments are removed
since bin/radicalpilot-stats is not working
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0056
/misc/backfilling.py
1. pdesc.resource is updated as "local.localhost"
2. pdesc.project is removed
3. wait(state=rp.ACTIVE) is updated as (state=rp.PMGR_ACTIVE)
4. cu.executable='SLEEP' is updated as '/bin/sleep'
However, call_back error is reported in pmgr log file. Further update is needed.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0053
/misc/backfilling_recovery.py
1. rp.Session argument 'name' is updated 'uid'
2. resource in pilot decription "localhost" is updated as "local.localhost"
3. state argument 'rp.ACTIVE' is updated as 'rp.PMGR_ACTIVE'
4. declaration of cud.cores is removed since it is not in unitmanager schema
5. unit.execution_locations, unit.start_time, unit.stop_time are excluded from print
statement since unit has not such methods.
However, call_back error is reported in umgr log file. Further update is needed.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0052
/misc/gpu_pilot.py
The run is completed but units are failed. Need for confirmation to remove the script.
session id: rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0061
/misc/running_mpi_executables.py
1. rp_host is updated as "local.localhost"
2. rp_project "TG-MCB090174" is removed
3. declaration of cud.cores and cud.mpi are removed since they are not in unitmanager schema.
4. unit.execution_locations, unit.start_time, unit.stop_time are excluded from print
statement since unit has not such methods.
However, call_back error is reported in umgr log file. Further update is needed.
rp.session.js-17-94.jetstream-cloud.org.aydins.018360.0062
radical-stack
python : 3.7.6
pythonpath :
virtualenv : /home/aydins/env_rptest
radical.entk : 1.0.1-v1.0.1-13-gca6384b@devel
radical.pilot : 1.2.1
radical.saga : 1.2.0
radical.utils : 1.2.2
A couple of comments:
cud.cores
are now cud.cpu_processes
.rp_app_master.py
and rp_app_worker.py
are used by rp_app_comm.py
(they are the workload) - so if we remove them, that example won't work anymore'RP_UNIT_ID','RP_WORK_QUEUE_IN', 'RP_RESULT_QUEUE_OUT'
are all set in the unit environment for the app comm workload.app_comm
example, the task_overlay
example has master and worker which form the workload executed on the target resource.misc/gpu_pilot.py
: yes. this can go.running_mpi_executables.py
- this can go, too.Hi Andre, Regarding the issues we discussed on slack:
task_overlay.py
Current version of the code calls the task_overlay.Master as below:
import radical.pilot as rp
rp.task_overlay.Master
task_overlay.py
is successfully completed.
task_overlay_master.py
generates this error:
Traceback (most recent call last):
File "task_overlay_master.py", line 107, in <module>
class MyMaster(rp.task_overlay.Master):
AttributeError: module 'radical.pilot' has no attribute 'task_overlay'
To call the task_overlay.Master, if we do the edit below:
import radical.pilot.task_overlay as rpt
rpt.Master
task_overlay.py
is successfully completed.
task_overlay_master.py
do not generate the same error (skips this one but gives another error).
I prefer to stick with this edit (We import rp. task_overlay as rpt in the code).
backfilling.py
No error message is printed on the console but the following error message is reported in pmgr.0000.log
file.
local.localhost', 'FAILED']>
1586550428.311 : pmgr.0000 : 32110 : 140709411669760 : DEBUG : pilot.0000 calls cb <bound method ComputePilot._default_state_cb of ['pilot.0000', 'local.localhost', 'FAILED']>
1586550428.311 : pmgr.0000 : 32110 : 140709411669760 : INFO : [Callback]: pilot pilot.0000 state: FAILED.
1586550428.311 : pmgr.0000 : 32110 : 140709411669760 : ERROR : [Callback]: pilot 'pilot.0000' failed (exit)
1586550428.311 : pmgr.0000 : 32110 : 140709411669760 : ERROR : listener died
Traceback (most recent call last):
File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/utils/zmq/pubsub.py", line 315, in _listener
cb(t, m)
File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 320, in _state_sub_cb
if not self._update_pilot(thing, publish=False):
File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 361, in _update_pilot
self._pilots[pid]._update(pilot_dict)
File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 206, in _update
else : cb([self])
File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 152, in _default_state_cb
ru.cancel_main_thread('int')
File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/utils/threads.py", line 164, in cancel_main_thread
sys.exit()
SystemExit
backfilling_recovery.py
Task unit.000002 state: FAILED, exit code: 2
is printed on the console. In addition, the following error message is reported in umgr.0000.log
file.
1586550943.279 : umgr.0000 : 16038 : 140235505657600 : ERROR : cb error (unit_state_cb)
Traceback (most recent call last):
File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/unit_manager.py", line 539, in _unit_cb
if cb_data: cb(unit, state, cb_data)
TypeError: unit_state_cb() takes 2 positional arguments but 3 were given
1586550943.279 : umgr.0000 : 16038 : 140235505657600 : ERROR : cb error (wait_queue_size_cb)
Traceback (most recent call last):
File "/home/aydins/env_rptest/lib/python3.7/site-packages/radical/pilot/unit_manager.py", line 539, in _unit_cb
if cb_data: cb(unit, state, cb_data)
TypeError: wait_queue_size_cb() takes 2 positional arguments but 3 were given
@andre-merzky I propose to delete the following examples.
[x] kmean
[x] cecam_example_SAL
[ ] gromacs
[ ] mandelbrot
These examples were done by former RADICAL students and there is no need for them anymore. if this proposal gets approved I will push the changes.
I don't mind removing them. Please make sure to also remove the data files which are included for some of them, and please make sure they are not referenced in the documentation. Thanks!
@aydinsaribudak : sorry, I missed your last message I think - is that still open?
@andre-merzky Yes, we had action items for a couple of scripts:
backfilling.py
backfilling_recovery.py
Briefly, I got call back failure messages from these scripts. The logs are reported above.
One another thing is about task_overlay
scripts.
I commit the update below for importing task_overlay folder.
import radical.pilot.task_overlay as rpt
rpt.Master
Thanks Aydin
@aydinsaribudak: I committed some changes to the backfilling examples which make them work as expected (for me). Can you give them a try again, please?
@andre-merzky
For backfilling.py
The run is successfully completed but I got the following error messages when I go to the session folder:
(env_rptest) aydins@js-17-94:~/radical.pilot/examples/misc/rp.session.js-17-94.jetstream-cloud.org.aydins.018373.0000$ grep -R ERROR *.*
pmgr.0000.log:1587487467.945 : pmgr.0000 : 28928 : 140508110255872 : ERROR : [Callback]: pilot 'pilot.0000' failed (exit)
pmgr.0000.log:1587487467.945 : pmgr.0000 : 28928 : 140508110255872 : ERROR : listener died
Is there an exception stack logged near that error message?
I couldnt see any exception stack near error message. P.S. here is the radical-stack info that I have in my test environment:
python : 3.7.6
pythonpath :
virtualenv : /home/aydins/env_rptest
radical.entk : 1.0.1-v1.0.1-13-gca6384b@devel
radical.pilot : 1.2.1
radical.saga : 1.2.0
radical.utils : 1.2.2
Aymen will give a try as well and will report if the error is replicated in his environment or not.
Hi @andre-merzky
For backfilling.py
:
The run is successfully completed and no error message is reported in the session folder. But it is reported on the console that the pilots are cancelled.
/[Callback]: ComputePilot 'pilot.0000' state: CANCELED.
\[Callback]: ComputePilot 'pilot.0001' state: CANCELED.
|[Callback]: ComputePilot 'pilot.0002' state: CANCELED.
There is no pilot folder created under the session folder.
For backfilling_recovery.py
:
The run is completed successfully, but we have the following error messages under session folder:
umgr.0000.log:1587616306.373 : umgr.0000 : 18871 : 140660749350656 : ERROR : cb error (unit_state_cb)
umgr.0000.log:1587616306.399 : umgr.0000 : 18871 : 140660749350656 : ERROR : cb error (wait_queue_size_cb)
Cancelled: yes, pilots are usually canceled during termination.
cb error: any exception traces near the error logs? @AymenFJA : can you have a look, please?
I started trying to replicate Aydins work. I will update the ticket soon.
Examples cleanup progress update:
/data_staging
(Passed).error_handling.py
(Passed).getting_started_osg_2.py
(Passed).getting_started_osg.py
(Andre confirmed it will not work):
BadParameter: 'JobDescription.CandidateHosts' (['!FIU_HPCOSG_CE']) not supported by radical.saga.adaptors.shell_job
molssi.py
(Will be deleted).probing.py
(Will be deleted):
if k not in schema: raise TypeError('key %s not in schema' % k)
TypeError: key argument not in schema
rp_analytics.py
(in progress)do we have a status update on this ticket?
Since
do we have a status update on this ticket?
By removing molssi
and probing
i can confirm that the example cleanup is done.
@aydinsaribudak and @AymenFJA began to clean up the RP examples - this ticket is to discuss those changes where needed.
This work lives in the
fix/travis
branch.