radical-cybertools / radical.analytics

Analytics for RADICAL-Cybertools
Other
1 stars 1 forks source link

failed processing entk session #108

Open lee212 opened 4 years ago

lee212 commented 4 years ago

This might or might not be related to py3 transition with entk but the session data to plot analytics is getting errors.

(analytics) [hrlee@login1.summit radical.analytics]$ bin/radical-analytics-inspect re.session.login1.hrlee.018273.0000
re.session.login1.hrlee.018273.0000 Traceback (most recent call last):
  File "bin/rp_inspect/plot_state.py", line 77, in <module>
    in thing.description.get('name', '').split()]
  File "bin/rp_inspect/plot_state.py", line 76, in <listcomp>
    pipe[thing.uid] = [int(x) for x
ValueError: invalid literal for int() with base 10: 'task.0000,None,stage.0000,MD,pipeline.0000,MD_ML'
...Traceback (most recent call last):
  File "bin/rp_inspect/plot_util.py", line 118, in <module>
    provided, consumed, stats_abs, stats_rel, info = session.utilization(metrics)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/analytics/session.py", line 961, in utilization
    provided  = rp.utils.get_provided_resources(self)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 479, in get_provided_resources
    data = _get_pilot_provision(session, p)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 437, in _get_pilot_provision
    cpn   = pilot.cfg['resource_details']['rm_info']['cores_per_node']
TypeError: 'NoneType' object is not subscriptable
 done
andre-merzky commented 4 years ago

The utilization plots were only tested with plain RP sessions. In the error above, the code seems to stumble over non-RP entities (such as RE tasks), whose UID cannot be correctly parsed. I think that RA's `bin/rp_inspect/plot_util.py should, after creation of the session, filter for all RP entities (and thus filter out all other entities (around line 107)

lee212 commented 4 years ago

Thanks Andre for the comment. I find <session>_cur.png, <session>_rate.png and <session>_conc.png were generated as an empty though.

I interpret the filtering is like this (with radical.pilot parameter in the plot_util.py):

(analytics) [hrlee@login2.summit radical.analytics]$ bin/rp_inspect/plot_util.py re.session.login1.hrlee.018273.0000 radical.pilot
Traceback (most recent call last):
  File "bin/rp_inspect/plot_util.py", line 118, in <module>
    provided, consumed, stats_abs, stats_rel, info = session.utilization(metrics)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/analytics/session.py", line 961, in utilization
    provided  = rp.utils.get_provided_resources(self)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 479, in get_provided_resources
    data = _get_pilot_provision(session, p)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 437, in _get_pilot_provision
    cpn   = pilot.cfg['resource_details']['rm_info']['cores_per_node']
TypeError: 'NoneType' object is not subscriptable

I will try to replicate the same steps with rp only for plotting purpose but I have a question. Does gpu resource affect analytics? I believe it wouldn't but core_per_node error makes me think gpus_per_node might create another issue.

andre-merzky commented 4 years ago

I see. Hyungro, can you provide me the session so that I can try to reproduce?

lee212 commented 4 years ago

re.session.login1.hrlee.018273.0000.tar.gz

lee212 commented 4 years ago

re.session.login1.hrlee.018273.0000.tar.gz pilot profiles added

lee212 commented 4 years ago

re.session.login1.hrlee.018273.0000.agent.tar.gz remote sandbox

andre-merzky commented 4 years ago

This problem should be gone when using radical-cybertools/radical.pilot/pull/2032. @lee212 implied on slack that this is not the case: could you please send the stack you are using? This stumbles still over the resource_details query for you, right?

lee212 commented 4 years ago

ValueError in the original post can be easily fixed by int() cast condition but regarding to the failing entk session, the error is still there:

$ bin/radical-analytics-inspect re.session.login5.hrlee.018282.0005
re.session.login5.hrlee.018282.0005 ....Traceback (most recent call last):
  File "/autofs/nccs-svm1_home1/hrlee/git/radical.analytics/bin/rp_inspect/plot_util.py", line 116, in <module>
    prov, cons, stats_abs, stats_rel, info = session.utilization(metrics)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/analytics/session.py", line 962, in utilization
    consumed  = rp.utils.get_consumed_resources(self)
  File "/ccs/home/hrlee/.conda/envs/analytics/lib/python3.7/site-packages/radical/pilot/utils/prof_utils.py", line 563, in get_consumed_resources
    p_max = pt(event=PILOT_DURATIONS['consume']['ignore'][1])[-1]
IndexError: list index out of range

My test runs show that entk is missing for cmd, for example:

>>> pilot_rp[0].events
[[15.303919553756714, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'NEW', '', 'pilot'], 
[15.421573162078857, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'], [15.422030925750732, 'put', 'pmgr.0000', 'MainThread', 'pilot.0000', 
'PMGR_LAUNCHING_PENDING', 'pmgr_launching_queue.put.0000', 'pilot'], 
[15.423201084136963, 'get', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'], 
[15.486279726028442, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING', '', 'pilot'], 
[16.787227392196655, 'staging_in_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[21.319350242614746, 'staging_in_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[21.53650426864624, 'submission_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[24.297909021377563, 'submission_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[24.297931909561157, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[25.630955934524536, 'bootstrap_0_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[25.654655933380127, 've_setup_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[27.444655895233154, 've_activate_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[28.221256017684937, 've_activate_stop', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[28.229156017303467, 'rp_install_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[43.8096559047699, 'rp_install_stop', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[43.81755590438843, 've_setup_stop', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[43.825356006622314, 've_activate_start', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'], 
[49.64625597000122, 'sync_rel', 'agent.0', 'MainThread', 'pilot.0000', '', 'agent.0', 'pilot'], 
[49.64625597000122, 'sync_rel', 'bootstrap_0', 'MainThread', 'pilot.0000', 'PMGR_ACTIVE_PENDING', 'agent.0', 'pilot'], [49.64628076553345, 'hostname', 'agent.0', 'MainThread', 'pilot.0000', '', 'login5.summit.olcf.ornl.gov', 'pilot'],
[72.51676154136658, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'], 
[72.51686191558838, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'], 
[72.51692461967468, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'], 
[72.51698350906372, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'], 
[72.64947366714478, 'state', 'agent.0', 'Thread-5', 'pilot.0000', 'PMGR_ACTIVE', '', 'pilot'], [73.33910584449768, 'get', 'agent.0', 'Thread-8', 'pilot.0000', '', 'bulk: 128', 'pilot'], 
[78.87498164176941, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', "pilot_heartbeat : {'pmgr': 'pmgr.0000'}", 'pilot'], 
[83.1139395236969, 'cmd', 'agent.0', 'Thread-7', 'pilot.0000', '', 'cancel_pilot : []', 'pilot'], [263.38332962989807, 'state', 'pmgr_launching.0000', 'Thread-4', 'pilot.0000', 'CANCELED', '', 'pilot']]

>>> pilot_entk[0].events
[[18.046828508377075, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'NEW', '', 'pilot'], 
[18.164909601211548, 'state', 'pmgr.0000', 'MainThread', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'], 
[18.165395736694336, 'put', 'pmgr.0000', 'MainThread', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', 'pmgr_launching_queue.put.0000', 'pilot'], 
[18.175760507583618, 'get', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING_PENDING', '', 'pilot'],
 [18.175876140594482, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_LAUNCHING', '', 'pilot'], 
[19.4575514793396, 'staging_in_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[24.4504554271698, 'staging_in_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[24.450796842575073, 'submission_start', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[27.079242706298828, 'submission_stop', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', '', '', 'pilot'], 
[27.07926607131958, 'state', 'pmgr_launching.0000', 'Thread-1', 'pilot.0000', 'PMGR_ACTIVE_PENDING', '', 'pilot'],
 [66.70450067520142, 'state', 'pmgr.0000', 'Thread-9', 'pilot.0000', 'PMGR_ACTIVE', '', 'pilot'], 
[267.71360754966736, 'state', 'pmgr_launching.0000', 'Thread-3', 'pilot.0000', 'CANCELED', '', 'pilot']]
lee212 commented 4 years ago

For the trace purpose, the pt() tried to collect PILOT_DURATIONS['consume']['ignore'][1] which is cmd (see code here): https://github.com/radical-cybertools/radical.pilot/blob/fix/ra_resource_details/src/radical/pilot/utils/prof_utils.py#L563

but the values look like:

>>> pt_rp(event=PILOT_DURATIONS['consume']['ignore'][1])
[72.51676154136658, 72.51686191558838, 72.51692461967468, 72.51698350906372, 78.87498164176941, 83.1139395236969]
>>> pt_rp(event=PILOT_DURATIONS['consume']['ignore'][1])[-1]
83.1139395236969

this looks correct, which is radical.pilot, but when I see radical.entk:

>>> pt_entk(event=PILOT_DURATIONS['consume']['ignore'][1])
[]
lee212 commented 4 years ago

Why I don't see any of the events in the entk profiling? For example, rp_inspect/plot_state.py filters 7 events e.g. AGENT_STAGING_INPUT but these 'AGENT' events are not in the entk event list. What it has

states: ['NEW', 'UMGR_SCHEDULING_PENDING', 'UMGR_SCHEDULING', 
'UMGR_STAGING_INPUT_PENDING', 'UMGR_STAGING_INPUT', 'AGENT_STAGING_INPUT_PENDING', 
'UMGR_STAGING_OUTPUT', 'DONE']

whereas rp shows:

states: ['NEW', 'UMGR_SCHEDULING_PENDING', 'UMGR_SCHEDULING', 
'UMGR_STAGING_INPUT_PENDING', 'UMGR_STAGING_INPUT', 'AGENT_STAGING_INPUT_PENDING', 
'AGENT_STAGING_INPUT', 'AGENT_SCHEDULING_PENDING', 'AGENT_SCHEDULING', 
'AGENT_EXECUTING_PENDING', 'AGENT_EXECUTING', 'AGENT_STAGING_OUTPUT_PENDING', 
'AGENT_STAGING_OUTPUT', 'UMGR_STAGING_OUTPUT_PENDING', 'UMGR_STAGING_OUTPUT', 
'DONE']

As long as RADICAL_PILOT_PROFILE=TRUE is set, session.close(download=True) will be called. And the events by profiling are the same on entk like rp, and additional profiling for wfprocessor, wmgr, and tmgr are included in the json.

Maybe this is related to _pilot.cancel()? if profiling data is recorded before termination?

mturilli commented 4 years ago

I am not sure I understand well the issue, apologies. AGENT* events are not in EnTK as it does not have an agent and, more in general, I do not think EnTK has an event model. The states you listed for EnTK seem correct to me and I do not think you can get any other state from EnTK because those are all the states it has. When you set profiling to true for both RP and EnTK, you should get EnTK states and RP states/events. Is this not happening?

lee212 commented 4 years ago

analytics try to filter these states and events for plots, at least for rp. If entk does not have, analytics need to use different logics or look for other states/events that entk provides. I assume that (from your comment "you should get EnTK states ..."), my session provides possible EnTK states according to this: https://github.com/radical-cybertools/radical.entk/blob/devel/src/radical/entk/states.py. However, I don't see them although RADICAL_ENTK_PROFILE=True is set. Let me dig into this with that.

andre-merzky commented 4 years ago

Why I don't see any of the events in the entk profiling? For example, rp_inspect/plot_state.py filters 7 events e.g. AGENT_STAGING_INPUT but these 'AGENT' events are not in the entk event list. What it has [...]

@lee212 : can you please send me a small reproducer so that I can see how you produced the event lists?

lee212 commented 4 years ago
import radical.analytics as ra
session = ra.Session.create("re.session.login1.hrlee.018285.0000", "radical.entk")
pilot_entk = session.get(etype="pilot")
pilot_entk[0].events
andre-merzky commented 4 years ago

That is not one of the sessions attached here I think? Do you mind adding it to the ticket? Thanks!

andre-merzky commented 4 years ago

states like AGENT_EXECUTING are no pilot states, so I am not sure where that comes from. I don't see that in other sessions from this ticket. For example, when I run this code:

#!/usr/bin/env python3

import radical.analytics as ra

sid     = 're.session.login1.hrlee.018273.0000'
session = ra.Session.create(sid, 'radical.entk')
pilots  = session.get(etype='pilot')

for e in pilots[0].events:
    print('event: ', e[1])

for e in pilots[0].events:
    print('state: ', e[5])

with the session re.session.login1.hrlee.018273.0000 you attached above (including the agent profiles from the separate tarball), I get this output:

$ ./t.py  | sort | uniq -c
      1 event:  bootstrap_0_start
    484 event:  cmd
      5 event:  get
      1 event:  hostname
      1 event:  put
      1 event:  rp_install_start
      1 event:  rp_install_stop
      1 event:  staging_in_start
      1 event:  staging_in_stop
      6 event:  state
      1 event:  submission_start
      1 event:  submission_stop
      2 event:  sync_rel
      2 event:  ve_activate_start
      1 event:  ve_activate_stop
      1 event:  ve_setup_start
      1 event:  ve_setup_stop
    494 state:
      1 state:  CANCELED
      1 state:  NEW
      1 state:  PMGR_ACTIVE
     10 state:  PMGR_ACTIVE_PENDING
      1 state:  PMGR_LAUNCHING
      3 state:  PMGR_LAUNCHING_PENDING

which seems like the expected set of states and events for a pilot. I guess I would need to look at your specific session to understand what's happening...

lee212 commented 4 years ago

This is interesting, same script I ran and the result looks like:

(analytics) [hrlee@login4.summit radical.analytics]$ python t.py|sort|uniq -c
      1 event:  get
      1 event:  put
      1 event:  staging_in_start
      1 event:  staging_in_stop
      6 event:  state
      1 event:  submission_start
      1 event:  submission_stop
      4 state:
      1 state:  CANCELED
      1 state:  NEW
      1 state:  PMGR_ACTIVE
      1 state:  PMGR_ACTIVE_PENDING
      1 state:  PMGR_LAUNCHING
      3 state:  PMGR_LAUNCHING_PENDING
lee212 commented 4 years ago

Any of the plots seems valid by radical-pilot-inspect command but I was looking at utilization first which runs bin/rp_inspect/plot_util.py. Specifically, one error is generated when pilot.cfg['resource_details']['rm_info']['cores_per_node'] is None and the other is when pt(event=PILOT_DURATIONS['consume']['ignore'][1])[-1] is None, if you look at the early comments in this ticket. These don't explain much but both errors happened in the ra.session.utilization method which starts with stype == radical.pilot assert. It might be okay if entk contains same states and events to calculate durations of cmds and others, otherwise, this pilot only has to be removed to call a utilization method from entk. If that is the case, the profiling methods of entk need some work (looking at radical.entk/utils/prof_utils.py) because necessary methods do not exist on entk, such as get_provided_resources, get_consumed_resources or can it be duplicate from RP? RP seeks duration of tasks between PMGR_ACTIVE and termination but does entk have same state and event ? e.g. {ru.STATE: rps.PMGR_ACTIVE }, {ru.EVENT: 'cmd' }]

andre-merzky commented 4 years ago

Are you really sure that the session dir contains the agent profiles under 'pilot.0000/'? If so, can you please attach an tarball of the dir again? Thanks.

lee212 commented 4 years ago

No, I don't think I had it, so I guess *.prof and *.json are not all for profiling. Let me try again.

lee212 commented 4 years ago

Two updates:

I keep searching where this difference is created.

lee212 commented 4 years ago

Can this be the one removing resource_details? https://github.com/radical-cybertools/radical.pilot/blob/690b5486126850740e21a801da6b4844f95bb8ee/src/radical/pilot/pmgr/launching/default.py#L498

Because I find resource_details exist at agent.0.log

mturilli commented 4 years ago

@lee212 , @andre-merzky can this be closed now? Did we find a solution for resource_details?

andre-merzky commented 4 years ago

resource_details may still be missing if the pilot terminates early. I am not sure if we still have an issue otherwise. Either way, since parts of that code have changed in RP, we should at least try to reconfirm that this still fails. There is no doubt that this should be working out of the box (for complete sessions).

lee212 commented 4 years ago

The TypeError message was seen in the recent sessions with the latest stacks, and yes the resource_details is missing with the early termination. There is another case that wasn't terminated early but broken i.e., re.session.login5.hrlee.018425.0020 and so I can't tell whether this is fixed or not.