Clarification: *process* level power usage stats or *processor* level stats ?

Problem

Hi there, I'm trying to understand what level of detail scaphandre is able to expose if you have it running on a host machine with the necessary RAPL capabilities.

From earlier conversations, I thought that scaphandre was designer to provide power usage stats at a per-process level.

When I refer to processes, I'm talking about a running process, often with a PID, that you might see in output from something like htop, or ps faux. Progams might have multiple processes associated with them, but usually, these are represented like a tree like below, so we can at least group them together, or tag them:

root         705  0.0  0.0 171024  2008 ?        Ss   Dec01   0:00 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
www-data     707  0.1  0.1 177560 27756 ?        S    Dec01  69:34  \_ nginx: worker process
www-data     708  0.0  0.1 179132 29200 ?        S    Dec01   3:56  \_ nginx: worker process
www-data     709  3.5  0.1 181008 31204 ?        R    Dec01 1425:20  \_ nginx: worker process
www-data     711  0.0  0.1 177776 27764 ?        S    Dec01   1:39  \_ nginx: worker process
www-data     712  0.0  0.0 171256  7740 ?        S    Dec01   7:39  \_ nginx: cache manager process

I haven't worked with intel RAPL measurements too much, but I always assumed they would might give information, but it was at a different level of granularity. The image below that I've taken from Mozilla's archived power profiling overview page makes me think that RAPL readings might show power at a something like a per-core level, or a group of cores (i.e. a package), but not a per-process level.

Different machines might provide separate figures like GPU or DRAM in different groupings, like the picture below suggests:

Firefox's own dev tools have a CLI tool, mach power, to provide some figures from RAPL sensors as well as some other stats to group processes together.

It literally reads the output of RAPL sensor, and on a mac , reads the output from powermetrics to give some output like below.

    total W = _pkg_ (cores + _gpu_ + other) + _ram_ W
#01 17.14 W = 14.98 ( 5.50 +  1.19 +  8.29) +  2.16 W

1 sample taken over a period of 30.000 seconds

Name                               ID     CPU ms/s  User%  Deadlines (<2 ms, 2-5 ms)  Wakeups (Intr, Pkg idle)  GPU ms/s
com.google.Chrome                  500    439.64                                      585.35  218.62            19.17
  Google Chrome Helper             67319  284.75    83.03  296.67  0.00               454.05  172.74            0.00
  Google Chrome Helper             67304  55.23     64.83  0.03    0.00               9.43    4.33              19.17
  Google Chrome                    67301  63.77     68.09  29.46   0.13               76.11   22.26             0.00
  Google Chrome Helper             67320  38.30     66.70  17.83   0.00               45.78   19.29             0.00
com.apple.WindowServer             68     102.58                                      112.36  43.15             80.52
  WindowServer                     141    103.03    58.19  60.48   6.40               112.36  43.15             80.53
com.apple.Safari                   499    267.19                                      110.53  46.05             1.69
  com.apple.WebKit.WebContent      67372  190.15    79.34  2.02    0.14               129.28  53.79             2.33
  com.apple.WebKit.Networking      67292  65.23     52.74  0.07    0.00               4.33    1.40              0.00
  Safari                           67290  29.09     77.65  0.23    0.00               7.13    3.37              0.00
  com.apple.Safari.SearchHelper    67371  13.88     91.18  0.00    0.00               0.36    0.05              0.00
  com.apple.WebKit.WebContent      67297  0.81      56.84  0.10    0.00               2.20    1.30              0.00
  com.apple.WebKit.WebContent      67293  0.46      76.40  0.03    0.00               0.57    0.20              0.00
  com.apple.WebKit.WebContent      67295  0.24      67.72  0.00    0.00               0.90    0.37              0.00
  com.apple.WebKit.WebContent      67298  0.17      59.88  0.00    0.00               0.50    0.13              0.00
  com.apple.WebKit.WebContent      67296  0.07      43.51  0.00    0.00               0.10    0.03              0.00
kernel_coalition                   1      111.76                                      724.80  213.09            0.12
  kernel_task                      0      107.06    0.00   5.86    0.00               724.46  212.99            0.12
org.mozilla.firefox                498    92.17                                       212.69  75.67             1.81
  firefox                          63865  61.00     87.18  1.00    0.87               25.79   9.00              1.81
  plugin-container                 67269  31.49     72.46  1.80    0.00               186.90  66.68             0.00
  com.apple.WebKit.Plugin.64       67373  55.55     74.38  0.74    0.00               9.51    3.13              0.02
com.apple.Terminal                 109    6.22                                        0.40    0.23              0.00
  Terminal                         208    6.25      92.99  0.00    0.00               0.33    0.20              0.00

I can guess how scapandre might provide expose stats to vms, or microVms using the --qemu options, and exposing a readable file for just the resources allocated to a VM. I think this is what is happening at lines 18-33 below:

https://github.com/hubblo-org/scaphandre/blob/main/src/sensors/powercap_rapl.rs#L18-L33

And based on this, I can understand how a provider of cloud services might be able to expose this customers paying for the cloud resources available to just their own VM, as a kind of scaphandre-inside style differentiator - and competing on transparency, as this is something larger providers like Amazon, Google and Microsoft have previously been less keen to disclose.

But I don't think I understand how scaphandre would be able to provide figures at a per-process level, so I could say… see how much all my nginx processes were using on a server,tim and get an idea of how much power running nginx across a bunch of machines might use in total.

If this possible?

My best guess at how I should use this to get meaningful figures for providing a service on a machine.

The only way I can imagine doing that would be to run nginx in it's own VM, docker container, Micro-VM or other way to allocate resources, and assume that all the power used by it would be something I tag as belonging to nginx.

I'm happy to contribute some suggested usage docs, or add to a FAQ, but I think I'll need some help answering until I'm fluent enough in rust, RAPL and any of the other moving parts myself.

Hi, It is possible, by combining the process stats from /proc/stats and the measurements at the cpu socket level. You need then a ratio of the time spent by the CPU socket doing something for the process and the time spent by the CPU socket doing something (for whatever). Then you multiply this ratio with the power consumption of the socket. This is how process level comsumption is computed in the prometheus exporter.

Then, thanks to promQL, you can filter processes by executable or command line arguments, to isolate concumption of a software that runs multiple processes. Examples are here: https://metrics.hubblo.org

I'll explain that in better details in the documentation, as I'm working on a refresh. (If you have time to give me some feedback it would be awesome: https://hubblo-org.github.io/scaphandre/)

It would be great to propose more advanced, embedded filtering features in scaphandre to be able to isolate the comsumption of a group of related processes, even when you don't have access to an advanced TSDB like prometheus. I'd happily discuss that.

I'll let you know when there will be a clean explanation for that in the "explanations" section.

OK, that's pretty smart.

Let me see if I can understand it, after reading a bit more about /proc/stats Rust's procfs crate.

So this bit here, where the hubblo dashboards are showing that python3 inside awx is using 0.739W:

Screenshot 2020-12-30 at 21 31 14

You're doing this by keeping a running register of:

all the processes that a CPU is doing work for,
how long it is spending on each task, by sampling N times every second
how much power is being drawn by by the CPU

From this, I think I have an idea here now, where you might understand how much power is being allocated to that python3 process over a set period of time, as a percentage of the total power being used by that CPU.

I couldn't see any process id in the output from /proc/stat, but this code here makes me think you are able to get the process id for that python3 process another way.

Once you have a list of processes, and their ids, I can see how get_process_power_consumption_microwatts, might allow for you to add the time spent.

And this code here inside refresh_procs, is how I'm guessing you get this list of process ids.

 /// Gets currently running processes (as procfs::Process instances) and stores
    /// them in self.proc_tracker
    fn refresh_procs(&mut self) {
        //! current_procs is the up to date list of processus running on the host
        let current_procs = process::all_processes().unwrap();

        for p in current_procs {
            let pid = p.pid;
            let res = self.proc_tracker.add_process_record(p);
            match res {
                Ok(_) => {}
                Err(msg) => panic!("Failed to track process with pid {} !\nGot: {}", pid, msg),
            }
        }
    }

How do you identify the process being worked on at any given moment? I get that inside /proc/ you have a bunch of processes identified numerically, like /proc/3015616/ and /proc/306/ and so on, but the link between this, and the CPU 'working on' that process isn't clear to me yet.

I'm happy to draw up a bunch of diagrams to explain it and contribute it to the docs, once I understand this last bit.

Oh hang on, I think I'm looking at it the wrong way - after a bit more googling, and reading the code, I'm assuming each process listed in process::all_processes() has some information listing how many Jiffies (Linux's measurement of process time, I think) that it's been using.

If you know how many process jiffies have been worked on over a set amount of time, and you know how many processes an individual CPU has worked on in the same time, you could get an idea of of how much of a CPU's time is spent on that process, right?

This blog post was useful when reading up how procfs listed running processes, and what info is provided per process: https://www.anshulpatel.in/post/linux_cpu_percentage/

Without the checking against the power sensors you can still get an idea of total CPU usage as a percentage, but the RAPL sensor lets you convert that to an absolute figure, in terms of Watts.

Is that closer to the truth?

You got it right. /proc/stats contains the statistics of the CPU usage. /proc/PID/stats contains the CPU usage statistics for a given PID. The procfs crate gives access to those metrics. process::all_processes() does return the list of running processes, with their statistics, iirc.

So what scaphandre does in the current implementation of the powercaprapl sensor is collecting those statistics every time the exporter requires it, store those stats in a buffer (you can look at the CPUStat structure in the code), for both the Topology(the whole CPU package), the CPUSockets and the Processes. It does the same for the RAPL based power consumption measurements (Record in the code) also for both the CPU-related entities and the processes. Then it is possible in the exporter to look at both CPU stats buffers and power measurements buffers and do the math to get the power consumption.

OK, I've made a few diagrams to help check my understanding of this and represent the some of key ideas,

I think after some more polish, and once we've added the correct names of the domain objects above, they might be helpful for others to get their head around the key concepts.

If you can add some comments or check that the general ideas are correct here, I can turn this into a page for the new docs - maybe in the explanation section.

The link below should be visible, and I if I know the address to invite, I can grant access so these can be edited by others, as well as export the PNGs to add to any docs.

I've used Figma as it's cross-platform, works in modern browsers so is easy for sharing, and can export as SVG, PNG and so on.

https://www.figma.com/file/MnmCxOZlUWgUiFD0JDN692/understanding-scaphandre?node-id=0%3A1

If you don't get to this before the end of 2020, happy new year :)

This is pretty amazing what you have done already to explain the principles, thanks a lot for sharing this ! This is really good work.

I try to figure out how to best way to articulate those pieces of explanation in the documentation. I think that's great that you started at the simplest level, explaining a bit about processes scheduling. What i'm not sure about is how many "levels" of explanation we should integrate prior to explaining in details how scaph computes the metrics.

There are also several topics covered here that may need some clarification. Explaining virtualization in the same discussion as scheduling seems a bit tricky (even if you succeeded in giving big ideas pretty clearly so far).

Maybe an "introduction" to explain the big principles (matching more or less the first 4 diagrams you have made), then one "real world" section to explain how it's more complicated in practice with multiple sockets/cores + hyperthreading (diag 5 may help here), then a "virtualization" section (matching diagrams 6,7,8,9) plus maybe a "cloud" one to talk about the opacity problem and then a last section to explain the solution scaph can bring to the table and how ?

As often I'm thinking out loud when I write so feel free to tell me if my thoughts and questions are not clear. I'd love to hear what you think about it, how you'd see such explanations and what you think is important that I didn't mention.

Glad you like it :)

Hmm… now that i think it over, I think the key idea that attracted me to Scaphandre was the idea of being able to get process-level stats for power usage, and by extension, understanding the carbon emissions from compute in a way that technologists can act directly on.

This is a level of detail I didn't know was possible, but had been looking for, for at a couple of years at least - I think communicating that is the most important idea.

I also agree that once you've got your head around that, you could move the rest into separate sections to describe common deployment/usage scenarios, and how to use the different features of Scaphandre to address the problems you typically run into.

i won't be able to get to this today, but in the coming days I can have a go at making a first draft of the initial 'key concepts' section, as a markdown.

Can you share a link to the directory you'd prefer I add the markdown and images to?

Hmm… now that i think it over, I think the key idea that attracted me to Scaphandre was the idea of being able to get process-level stats for power usage, and by extension, understanding the carbon emissions from compute in a way that technologists can act directly on.

We have kind of the same idea about that :)

This is a level of detail I didn't know was possible, but had been looking for, for at a couple of years at least - I think communicating that is the most important idea.

I agree, there is a lot of communication work to do on top of that to make it impactful. Your help on this is very much appreciated.

i won't be able to get to this today, but in the coming days I can have a go at making a first draft of the initial 'key concepts' section, as a markdown.

No worries, you choose if, and when you want to engage in the project of course. Thanks for noticing that you are interested in contributing, this means a lot.

Can you share a link to the directory you'd prefer I add the markdown and images to?

I'm currently working on a branch for the new documentation basis: https://github.com/hubblo-org/scaphandre/tree/feature/%2333-Bootstrap-an-accurate-documentation-structure Opening PRs to that branch you be very practical for me to integrate your work. The folder to work in is docs_src and here is a short procedure to be able to build the docs with your changes and test if it renders what you imagined:

# install mdbook
cargo install mdbook
# move into the scaphandre repo (on the same branch I mentionned, a git pull might be needed first)
cd scaphandre && git checkout feature/#33-Bootstrap-an-accurate-documentation-structure
# launch the development server of mdbook, then every change will be visible at http://localhost:3000 in your browser
mdbook serve

If you need more informations about mdbook, here it is.

hey, What do you think about the cache line? does it play an essential role in power consumption? If yes, How is it taken into account here?

hubblo-org / scaphandre