radical-cybertools / ExTASY

MDEnsemble
Other
1 stars 1 forks source link

Amber/CoCo - polling interval query #96

Closed CharlieLaughton closed 9 years ago

CharlieLaughton commented 9 years ago

I notice when running the Amber/CoCo test jobs on Stampede that amongst the voluminous debug messages are periods when the same message is repeated many times per second. Can I just check that this doesn't reflect some actual network traffic? A snippet from the log file is shown below:

[...] 2014:11:05 10:49:28 radical.pilot.MainProcess: [INFO ] RUN ComputeUnit '545a007e02529a0abd7ff9ec' state changed from 'Scheduling' to 'Executing'. 2014:11:05 10:49:28 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:28 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:28 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:28 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:28 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:28 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:29 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:29 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:29 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:29 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:29 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:29 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:29 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:29 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:29 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:29 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:30 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:30 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:30 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:30 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:30 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:30 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:30 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:30 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:30 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:49:30 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing

[many more lines like this...]

2014:11:05 10:50:16 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:16 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:16 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:16 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:17 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:17 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:17 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:17 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:17 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:17 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:17 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:17 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:17 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:17 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:18 radical.pilot.MainProcess: [INFO ] Performing periodical health check for 5459fff902529a0abd7ff9da (SAGA job id [slurm+ssh://stampede.tacc.utexas.edu/]-[4394942]) 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:18 2749 PilotLauncherWorker-1 saga.SLURMJobService : [DEBUG ] run_sync: scontrol show job 4394942 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] write: [ 29] [ 26](scontrol show job 4394942n) 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] read : [ 29] [ 971](JobId=4394942 Name=SAGAPythonS ... 02529a0abd7ff9dannPROMPT-0-) 2014:11:05 10:50:18 2749 PilotLauncherWorker-1 saga.SLURMJobService : [DEBUG ] run_sync: scontrol show job 4394942 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] write: [ 29] [ 26](scontrol show job 4394942n) 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] Compute unit 545a007e02529a0abd7ff9ec in state Executing 2014:11:05 10:50:18 radical.pilot.MainProcess: [DEBUG ] read : [ 29] [ 971](JobId=4394942 Name=SAGAPythonS ... 02529a0abd7ff9dannPROMPT-0-) 2014:11:05 10:50:18 radical.pilot.MainProcess: [INFO ] pilot 5459fff902529a0abd7ff9da seems alive and well

[..etc...]

oleweidner commented 9 years ago

These are actually callbacks being triggered: when the Pilot becomes 'active', the CUs change their state to 'executing' all at once. There's no polling involved, so it doesn't cause any performance problems.

Generally, all these log-message won't show up if you don't set the "VERBOSE" envrionment variables. At the moment we do this for testing purposes, but in production, this shouldn't be necessary.

I am working on the Extasy "load profile" (CPU utilization et al.) and I should have something ready by tomorrow.

andre-merzky commented 9 years ago

Ole, those callbacks are all from the same CU, I don't think that a state change for one CU should be reported more than once. Or are those states actually repeatedly in the state history of the CU? That would also be wrong, I think?

oleweidner commented 9 years ago

It's in the code like that: https://github.com/radical-cybertools/radical.pilot/blob/master/src/radical/pilot/compute_unit.py#L400

oleweidner commented 9 years ago

I have removed this high-frequency debug output in the RADICAL-Pilot master branch. This should disappear with the next release ;-)