sched-ext / scx

sched_ext schedulers and tools
https://bit.ly/scx_slack
GNU General Public License v2.0
691 stars 48 forks source link

scx_utils: Add log_recorder module for metrics-rs #384

Closed jfernandez closed 1 week ago

jfernandez commented 1 week ago

This change adds a new module to the scx_utils crate that provides a log recorder for metrics-rs. The log recorder will log all metrics to the console at a configurable interval in an easy to read format. Each metric type will be displayed in a separate section. Indentation will be used to show the hierarchy of the metrics. This results in a more verbose output, but it is easier to read and understand.

scx_rusty was updated to use the log recorder and all explicit metric logging was removed.

Counters will show the total count and the rate of change per second. Counters with an additional label, like type in dispatched_tasks_total in rusty, will show the count, rate, and percentage of the total count.

Counters:
  dispatched_tasks_total: 65559 [1344.8/s]
    prev_idle: 44963 (68.6%) [966.5/s]
    wsync_prev_idle: 15696 (23.9%) [317.3/s]
    direct_dispatch: 2833 (4.3%) [35.3/s]
    dsq: 1804 (2.8%) [21.3/s]
    wsync: 262 (0.4%) [4.3/s]
    direct_greedy: 1 (0.0%) [0.0/s]
    pinned: 0 (0.0%) [0.0/s]
    greedy_idle: 0 (0.0%) [0.0/s]
    greedy_xnuma: 0 (0.0%) [0.0/s]
    direct_greedy_far: 0 (0.0%) [0.0/s]
    greedy_local: 0 (0.0%) [0.0/s]
  dl_clamped_total: 1290 [20.3/s]
  dl_preset_total: 514 [1.0/s]
  kick_greedy_total: 6 [0.3/s]
  lb_data_errors_total: 0 [0.0/s]
  load_balance_total: 0 [0.0/s]
  repatriate_total: 0 [0.0/s]
  task_errors_total: 0 [0.0/s]

Gauges will show the last set value:

Gauges:
  slice_length_us: 20000.00

Histograms will show the average, min, and max. The histogram will be reset after each log interval to avoid memory leaks, since the data structure that holds the samples is unbounded.

Histograms:
  cpu_busy_pct: avg=1.66 min=1.16 max=2.16
  load_avg node=0: avg=0.31 min=0.23 max=0.39
  load_avg node=0 dom=0: avg=0.31 min=0.23 max=0.39
  processing_duration_us: avg=297.50 min=296.00 max=299.00

Screencast from 2024-06-23 16-52-37.webm

jfernandez commented 1 week ago

The failure in CI appears to be unrelated to this change:

23:56:35 [INFO] Histograms:
23:56:35 [INFO]   cpu_busy_pct: avg=NaN min=0.00 max=0.00
23:56:35 [INFO]   processing_duration_us: avg=NaN min=0.00 max=0.00
23:56:35 [INFO] ---
thread 'main' panicked at src/load_balance.rs:689:25:
attempt to subtract with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I will fix the NaN in the print output though.