powa-team / powa-web

PoWA user interface
http://powa.readthedocs.io/
73 stars 31 forks source link

Sum of CPU usage may exceed 100% #164

Closed pgiraud closed 1 year ago

pgiraud commented 1 year ago

The chart for the CPU time repartition look a bit weird.

Screenshot from 2022-11-23 21-27-14

In the /server/0/metrics/database/powa/query/{queryid}/ response:

    {
      "ts": 1669234819.7106,
      "rows": 82.66141215623394,
      "calls": 0.7999491498990381,
      "hit_ratio": 99.98556998556998,
      "shared_blks_read": 546.0986196644,
      "shared_blks_hit": 3783917.335654697,
      "shared_blks_dirtied": 0.0,
      "shared_blks_written": 0.0,
      "local_blks_read": 0.0,
      "local_blks_hit": 0.0,
      "local_blks_dirtied": 0.0,
      "local_blks_written": 0.0,
      "temp_blks_read": 0.0,
      "temp_blks_written": 0.0,
      "blk_read_time": 0.0,
      "blk_write_time": 0.0,
      "avg_runtime": 19.8117000000002,
      "avg_plantime": 0.0,
      "wal_records": 0.0,
      "wal_fpi": 0.0,
      "wal_bytes": 0.0,
      "reads": 0.0,
      "writes": 0.0,
      "minflts": 552.131569503232,
      "majflts": 0.0,
      "nvcsws": 2.1331977330641014,
      "nivcsws": 1.5998982997980762,
      "user_time": 92.38606480009229,
      "system_time": 23.44027350841523,
      "other_time": 0.0,
      "disk_hit_ratio": 0.0,
      "sys_hit_ratio": 0.01443001443001443
    }
rjuju commented 1 year ago

I think this is somewhat expected unfortunately. This graphs is based on getrusage() system call, and the clock precision isn't infinite so there are always some time where some CPU usage is added to the wrong query (and some CPU usage is of course not added to the correct one). We have some mechanism to try to limit that effect (see https://github.com/powa-team/pg_stat_kcache/blob/master/pg_stat_kcache.c#L392-L400), but a bit of drift is still possible.

Note also that you can also get way more than 100% CPU with parallelism. Since you have multiple process executing the same query, you can actually have multiple time the amount of time of resources spent and therefore end up with 200 or 300% CPU if you get 1 or 2 parallel workers and the query bottleneck is on the CPU.

pgiraud commented 1 year ago

Thanks for the explanation.