Open lee212 opened 4 years ago
The utilization plots were only tested with plain RP sessions. In the error above, the code seems to stumble over non-RP entities (such as RE tasks), whose UID cannot be correctly parsed. I think that RA's `bin/rp_inspect/plot_util.py should, after creation of the session, filter for all RP entities (and thus filter out all other entities (around line 107)
Thanks Andre for the comment. I find <session>_cur.png
, <session>_rate.png
and <session>_conc.png
were generated as an empty though.
I interpret the filtering is like this (with radical.pilot
parameter in the plot_util.py
):
(analytics) [hrlee@login2.summit radical.analytics]$ bin/rp_inspect/plot_util.py re.session.login1.hrlee.018273.0000 radical.pilot
Traceback (most recent call last):
File "bin/rp_inspect/plot_util.py", line 118, in <module>
provided, consumed, stats_abs, stats_rel, info = session.utilization(metrics)
File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/analytics/session.py", line 961, in utilization
provided = rp.utils.get_provided_resources(self)
File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 479, in get_provided_resources
data = _get_pilot_provision(session, p)
File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 437, in _get_pilot_provision
cpn = pilot.cfg['resource_details']['rm_info']['cores_per_node']
TypeError: 'NoneType' object is not subscriptable
I will try to replicate the same steps with rp only for plotting purpose but I have a question. Does gpu resource affect analytics? I believe it wouldn't but core_per_node
error makes me think gpus_per_node
might create another issue.
I see. Hyungro, can you provide me the session so that I can try to reproduce?
re.session.login1.hrlee.018273.0000.tar.gz pilot profiles added
re.session.login1.hrlee.018273.0000.agent.tar.gz remote sandbox
This problem should be gone when using radical-cybertools/radical.pilot/pull/2032. @lee212 implied on slack that this is not the case: could you please send the stack you are using? This stumbles still over the resource_details
query for you, right?
ValueError
in the original post can be easily fixed by int()
cast condition but regarding to the failing entk session, the error is still there:
$ bin/radical-analytics-inspect re.session.login5.hrlee.018282.0005
re.session.login5.hrlee.018282.0005 ....Traceback (most recent call last):
File "/autofs/nccs-svm1_home1/hrlee/git/radical.analytics/bin/rp_inspect/plot_util.py", line 116, in <module>
prov, cons, stats_abs, stats_rel, info = session.utilization(metrics)
File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/analytics/session.py", line 962, in utilization
consumed = rp.utils.get_consumed_resources(self)
File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 563, in get_consumed_resources
p_max = pt(event=PILOT_DURATIONS['consume']['ignore'][1])[-1]
IndexError: list index out of range
My test runs show that entk is missing for cmd
, for example:
>>> pilot_rp[0].events
[[15.303919553756714, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'NEW', '', 'pilot'],
[15.421573162078857, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'], [15.422030925750732, 'put', 'pmgr.0000', 'MainThread', 'pilot.0000',
'PMGR_LAUNCHING_PENDING', 'pmgr_launching_queue.put.0000', 'pilot'],
[15.423201084136963, 'get', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'],
[15.486279726028442, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING', '', 'pilot'],
[16.787227392196655, 'staging_in_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'],
[21.319350242614746, 'staging_in_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'],
[21.53650426864624, 'submission_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'],
[24.297909021377563, 'submission_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'],
[24.297931909561157, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
[25.630955934524536, 'bootstrap_0_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
[25.654655933380127, 've_setup_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
[27.444655895233154, 've_activate_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
[28.221256017684937, 've_activate_stop', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
[28.229156017303467, 'rp_install_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
[43.8096559047699, 'rp_install_stop', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
[43.81755590438843, 've_setup_stop', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
[43.825356006622314, 've_activate_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
[49.64625597000122, 'sync_rel', 'agent.0', 'MainThread', 'pilot.0000', '', 'agent.0', 'pilot'],
[49.64625597000122, 'sync_rel', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', 'agent.0', 'pilot'], [49.64628076553345, 'hostname', 'agent.0', 'MainThread', 'pilot.0000', '', 'login5.summit.olcf.ornl.gov', 'pilot'],
[72.51676154136658, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'],
[72.51686191558838, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'],
[72.51692461967468, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'],
[72.51698350906372, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'],
[72.64947366714478, 'state', 'agent.0', 'Thread-5', 'pilot.0000', 'PMGR_ACTIVE', '', 'pilot'], [73.33910584449768, 'get', 'agent.0', 'Thread-8', 'pilot.0000', '', 'bulk: 128', 'pilot'],
[78.87498164176941, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'],
[83.1139395236969, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', 'cancel_pilot : []', 'pilot'], [263.38332962989807, 'state', 'pmgr_launching.0000', 'Thread-4', 'pilot.0000', 'CANCELED', '', 'pilot']]
>>> pilot_entk[0].events
[[18.046828508377075, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'NEW', '', 'pilot'],
[18.164909601211548, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'],
[18.165395736694336, 'put', 'pmgr.0000', 'MainThread', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', 'pmgr_launching_queue.put.0000', 'pilot'],
[18.175760507583618, 'get', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'],
[18.175876140594482, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING', '', 'pilot'],
[19.4575514793396, 'staging_in_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'],
[24.4504554271698, 'staging_in_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'],
[24.450796842575073, 'submission_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'],
[27.079242706298828, 'submission_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'],
[27.07926607131958, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
[66.70450067520142, 'state', 'pmgr.0000', 'Thread-9', 'pilot.0000', 'PMGR_ACTIVE', '', 'pilot'],
[267.71360754966736, 'state', 'pmgr_launching.0000', 'Thread-3', 'pilot.0000', 'CANCELED', '', 'pilot']]
For the trace purpose, the pt()
tried to collect PILOT_DURATIONS['consume']['ignore'][1]
which is cmd
(see code here):
https://github.com/radical-cybertools/radical.pilot/blob/fix/ra_resource_details/src/radical/pilot/utils/prof_utils.py#L563
but the values look like:
>>> pt_rp(event=PILOT_DURATIONS['consume']['ignore'][1])
[72.51676154136658, 72.51686191558838, 72.51692461967468, 72.51698350906372, 78.87498164176941, 83.1139395236969]
>>> pt_rp(event=PILOT_DURATIONS['consume']['ignore'][1])[-1]
83.1139395236969
this looks correct, which is radical.pilot, but when I see radical.entk:
>>> pt_entk(event=PILOT_DURATIONS['consume']['ignore'][1])
[]
Why I don't see any of the events in the entk profiling? For example, rp_inspect/plot_state.py
filters 7 events e.g. AGENT_STAGING_INPUT but these 'AGENT' events are not in the entk event list. What it has
states: ['NEW', 'UMGR_SCHEDULING_PENDING', 'UMGR_SCHEDULING',
'UMGR_STAGING_INPUT_PENDING', 'UMGR_STAGING_INPUT', 'AGENT_STAGING_INPUT_PENDING',
'UMGR_STAGING_OUTPUT', 'DONE']
whereas rp shows:
states: ['NEW', 'UMGR_SCHEDULING_PENDING', 'UMGR_SCHEDULING',
'UMGR_STAGING_INPUT_PENDING', 'UMGR_STAGING_INPUT', 'AGENT_STAGING_INPUT_PENDING',
'AGENT_STAGING_INPUT', 'AGENT_SCHEDULING_PENDING', 'AGENT_SCHEDULING',
'AGENT_EXECUTING_PENDING', 'AGENT_EXECUTING', 'AGENT_STAGING_OUTPUT_PENDING',
'AGENT_STAGING_OUTPUT', 'UMGR_STAGING_OUTPUT_PENDING', 'UMGR_STAGING_OUTPUT',
'DONE']
As long as RADICAL_PILOT_PROFILE=TRUE
is set, session.close(download=True)
will be called. And the events by profiling are the same on entk like rp, and additional profiling for wfprocessor, wmgr, and tmgr are included in the json.
Maybe this is related to _pilot.cancel()
? if profiling data is recorded before termination?
I am not sure I understand well the issue, apologies. AGENT*
events are not in EnTK as it does not have an agent and, more in general, I do not think EnTK has an event model. The states you listed for EnTK seem correct to me and I do not think you can get any other state from EnTK because those are all the states it has. When you set profiling to true for both RP and EnTK, you should get EnTK states and RP states/events. Is this not happening?
analytics try to filter these states and events for plots, at least for rp. If entk does not have, analytics need to use different logics or look for other states/events that entk provides. I assume that (from your comment "you should get EnTK states ..."), my session provides possible EnTK states according to this: https://github.com/radical-cybertools/radical.entk/blob/devel/src/radical/entk/states.py.
However, I don't see them although RADICAL_ENTK_PROFILE=True
is set. Let me dig into this with that.
Why I don't see any of the events in the entk profiling? For example, rp_inspect/plot_state.py filters 7 events e.g. AGENT_STAGING_INPUT but these 'AGENT' events are not in the entk event list. What it has [...]
@lee212 : can you please send me a small reproducer so that I can see how you produced the event lists?
import radical.analytics as ra
session = ra.Session.create("re.session.login1.hrlee.018285.0000", "radical.entk")
pilot_entk = session.get(etype="pilot")
pilot_entk[0].events
That is not one of the sessions attached here I think? Do you mind adding it to the ticket? Thanks!
states like AGENT_EXECUTING
are no pilot states, so I am not sure where that comes from. I don't see that in other sessions from this ticket. For example, when I run this code:
#!/usr/bin/env python3
import radical.analytics as ra
sid = 're.session.login1.hrlee.018273.0000'
session = ra.Session.create(sid, 'radical.entk')
pilots = session.get(etype='pilot')
for e in pilots[0].events:
print('event: ', e[1])
for e in pilots[0].events:
print('state: ', e[5])
with the session re.session.login1.hrlee.018273.0000
you attached above (including the agent profiles from the separate tarball), I get this output:
$ ./t.py | sort | uniq -c
1 event: bootstrap_0_start
484 event: cmd
5 event: get
1 event: hostname
1 event: put
1 event: rp_install_start
1 event: rp_install_stop
1 event: staging_in_start
1 event: staging_in_stop
6 event: state
1 event: submission_start
1 event: submission_stop
2 event: sync_rel
2 event: ve_activate_start
1 event: ve_activate_stop
1 event: ve_setup_start
1 event: ve_setup_stop
494 state:
1 state: CANCELED
1 state: NEW
1 state: PMGR_ACTIVE
10 state: PMGR_ACTIVE_PENDING
1 state: PMGR_LAUNCHING
3 state: PMGR_LAUNCHING_PENDING
which seems like the expected set of states and events for a pilot. I guess I would need to look at your specific session to understand what's happening...
This is interesting, same script I ran and the result looks like:
(analytics) [hrlee@login4.summit radical.analytics]$ python t.py|sort|uniq -c
1 event: get
1 event: put
1 event: staging_in_start
1 event: staging_in_stop
6 event: state
1 event: submission_start
1 event: submission_stop
4 state:
1 state: CANCELED
1 state: NEW
1 state: PMGR_ACTIVE
1 state: PMGR_ACTIVE_PENDING
1 state: PMGR_LAUNCHING
3 state: PMGR_LAUNCHING_PENDING
Any of the plots seems valid by radical-pilot-inspect
command but I was looking at utilization first which runs bin/rp_inspect/plot_util.py
. Specifically, one error is generated when pilot.cfg['resource_details']['rm_info']['cores_per_node']
is None
and the other is when pt(event=PILOT_DURATIONS['consume']['ignore'][1])[-1]
is None
, if you look at the early comments in this ticket. These don't explain much but both errors happened in the ra.session.utilization
method which starts with stype == radical.pilot
assert. It might be okay if entk contains same states and events to calculate durations of cmds and others, otherwise, this pilot only has to be removed to call a utilization method from entk. If that is the case, the profiling methods of entk need some work (looking at radical.entk/utils/prof_utils.py
) because necessary methods do not exist on entk, such as get_provided_resources
, get_consumed_resources
or can it be duplicate from RP?
RP seeks duration of tasks between PMGR_ACTIVE and termination but does entk have same state and event ?
e.g. {ru.STATE: rps.PMGR_ACTIVE }, {ru.EVENT: 'cmd' }]
Are you really sure that the session dir contains the agent profiles under 'pilot.0000/'? If so, can you please attach an tarball of the dir again? Thanks.
No, I don't think I had it, so I guess *.prof
and *.json
are not all for profiling. Let me try again.
Two updates:
empty plots (conc, rate, state):
radical-pilot-fetch-db <session uid>
downloads profiles in the session dir but the pilot.0000/
was nested under the uid again e.g. re.session.login4.hrlee.018309.0001/re.session.login4.hrlee.018309.0001/pilot.0000
. After manually moving 1 level up, analytics were able to read *.prof
. This change resolves empty plots except utilization.
failed to generate resource utilization due to missing resource_details
*.json
indeed does not contain resource_details, it shows None
although it would be like:
{'lm_detail': None,
'lm_info': {'FORK': {'version': '0.42', 'version_detail': 'There is no spoon'}},
'rm_info': {'agent_nodes': {}, 'cores_per_node': 8, 'gpus_per_node': 1,
'lfs_per_node': {'path': '/tmp', 'size': 1024},
'lm_info': {'version_info': {'FORK': {'version': '0.42', 'version_detail': 'There is no spoon'}}},
'mem_per_node': 0, 'name': 'Fork',
'node_list': [['localhost', 'localhost_0']]}}
I keep searching where this difference is created.
Can this be the one removing resource_details
? https://github.com/radical-cybertools/radical.pilot/blob/690b5486126850740e21a801da6b4844f95bb8ee/src/radical/pilot/pmgr/launching/default.py#L498
Because I find resource_details exist at agent.0.log
@lee212 , @andre-merzky can this be closed now? Did we find a solution for resource_details
?
resource_details
may still be missing if the pilot terminates early. I am not sure if we still have an issue otherwise. Either way, since parts of that code have changed in RP, we should at least try to reconfirm that this still fails. There is no doubt that this should be working out of the box (for complete sessions).
The TypeError
message was seen in the recent sessions with the latest stacks, and yes the resource_details
is missing with the early termination. There is another case that wasn't terminated early but broken i.e., re.session.login5.hrlee.018425.0020 and so I can't tell whether this is fixed or not.
This might or might not be related to py3 transition with entk but the session data to plot analytics is getting errors.