hubblo-org / scaphandre

⚡ Energy consumption metrology agent. Let "scaph" dive and bring back the metrics that will help you make your systems and applications more sustainable !
Apache License 2.0
1.6k stars 106 forks source link

Integrate power consumption estimation on public cloud virtual machines as a degraded mode #25

Open bpetit opened 3 years ago

bpetit commented 3 years ago

Problem

Until the cloud provider does install scaphandre on its hypervisors, we should enable cloud customers to estimate their power usage and thus their climate impact.

Solution

Integrate statistical models like https://github.com/etsy/cloud-jewels for GCP (look for other models) as sensors to enable using scaphandre on cloud providers even if they didn't implement (yet) a scaphandre-like solution from the hypervisor layer.

Alternatives

Implements a ratio based approached like powertop (cpu time consumed / cpu time globally consumed) in this context (VM). Mix it with provider public informations about hypervisor hosts hardware ?

Additional context

This is more thatn a feature request, it is a starting point for a wider study and discussions.

mrchrisadams commented 3 years ago

@bpetit did you imagine this as a kind of synthetic, 'modelled' sensor, that provides numbers based on the detected number of CPUs, RAM and so on?

mrchrisadams commented 3 years ago

There's some work likely to be released by the German Minisitry of the environment in December that will provide some numbers that could inform this, as it's based on actual recorded energy usage figures for a known set of machines:

https://www.umweltbundesamt.de/en/press/pressinformation/video-streaming-data-transmission-technology

However, I haven't seen the underlying data yet.

Although, it maybe worth speaking to the folks at https://datacenterlight.ch - I know they pretty see much see openness and transparency as a one of their differentiators and might be able to share the some numbers for an basic version of this.

bpetit commented 3 years ago

Hi,

First, thanks for involving in this project !

@bpetit did you imagine this as a kind of synthetic, 'modelled' sensor, that provides numbers based on the detected number of CPUs, RAM and so on?

I think so. I imagine a sensor that gathers metrics about resource consumption on the machine, plus data/characteristics about bare metal machines running the VM, that the cloud provider accepts to disclose.

There's some work likely to be released by the German Minisitry of the environment in December that will provide some numbers that could inform this, as it's based on actual recorded energy usage figures for a known set of machines

I'll have a look at that, thanks !

Although, it maybe worth speaking to the folks at https://datacenterlight.ch - I know they pretty see much see openness and transparency as a one of their differentiators and might be able to share the some numbers for an basic version of this.

I'm definitely interested to talk with people from ungleich/datacenterlight to imagine a proof of concept. (I love their work <3) If you know someone there we could discuss with I'd be very interested !

bpetit commented 3 years ago

I'm investigating on the feasibility to feed a centralized database of processor models per cloud provider that would both benefit from and to scaphandre. If you have access to an instance on any public cloud provider, could you send the content of /proc/cpuinfo in this thread please ? With name of the cloud provider, model of the instance, content of the file ? I'd like to check that this file contains enough data for most of the cloud providers. @mrchrisadams @florimondmanca @uggla @jdrouet

florimondmanca commented 3 years ago

@bpetit Here's what I've got on my personal VM:

$ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 85
model name  : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
stepping    : 4
microcode   : 0x1
cpu MHz     : 2294.608
cache size  : 25344 KB
physical id : 0
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : ... REDACTED
bugs        : ... REDACTED
bogomips    : 4589.21
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
jdrouet commented 3 years ago

@bpetit my cloud provider is Ikoula

cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6134 CPU @ 3.20GHz
stepping        : 4
microcode       : 0x2000043
cpu MHz         : 3200.078
cache size      : 25344 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush acpi mmx fxsr sse sse2 syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti intel_ppin ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt clwb xsaveopt xsavec xgetbv1 xsaves pku ospke
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips        : 6400.06
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
florimondmanca commented 3 years ago

Ikoula

https://www.ikoula.com/fr/cloud-public

Oh, Cloud offerings seem super nice. VM Micro is basically the same specs and price than my DO droplet. Except Ikoula is French, has servers in the EU, low-carbon, etc… I've been looking for some bog-standard public Cloud like this at a competitive price for a while! Sweet. :-)

pydubreucq commented 3 years ago

Hi, I'm VP Bare Metal at Scaleway

I'll give you cpuinfo of our Bare Metal offers

First is our Ultimate Performance Range:

Offer: UP-BM2-XL CPU: 8 x Intel® Xeon® Platinum 8280 - 128 cores / 256 threads / 2,25 GHz

scaleway-cpuinfo-up-bm2-xl.txt

Offer: UP-BM2-M CPU: 2 x AMD EPYC™ 7742 - Zen 2 - 224 cores / 448 threads / 2,7 GHz

scaleway-cpuinfo-up-bm2-m.txt

pydubreucq commented 3 years ago

The second one is Bare Metal General Purpose Range

Offer: GP-BM1-S & GP-BM1-M (Same CPU) CPU: 1 × Intel® Xeon E3 1240v6 - 4 cores / 8 threads / 3,7 GHz

scaleway-cpuinfo-gp-bm1-s_and_m.txt

Offer: GP-BM1-L CPU: 1 × AMD EPYC 7281 - 16 cores / 32 threads / 2,1 GHz

scaleway-cpuinfo-gp-bm1-l.txt

bpetit commented 3 years ago

Thanks a lot @jdrouet and @florimondmanca for the data. This seems to confirms that, most of the time, on most cloud providers, instances cpuinfo contains enough data to guess the baremetal hypervisor cpu model. This is great to move forward on that feature.

Hi, and thanks a lot @pydubreucq for the data about scaleway's bare metal machines. Even if this thread is more about guessing the consumption in a virtual machine without having access to the bare metal, this is valuable data. Would you have by any chance the same data for the machines that are running scalway cloud's instances ?

pydubreucq commented 3 years ago

Integrate power consumption estimation on public cloud virtual machines as a degraded mode

I missed something :) Sorry for the noice :)

I'll try to get these info for Instances.

pydubreucq commented 3 years ago

Scaleway DEV Range

DEV1-S scaleway-instance-dev1-s.txt

DEV1-M scaleway-instance-dev1-m.txt

DEV1-L scaleway-instance-dev1-l.txt

DEV1-XL scaleway-instance-dev1-xl.txt

pydubreucq commented 3 years ago

Scaleway GP Range (part 1)

GP1-XS scaleway-instance-gp1-xs.txt

GP1-S scaleway-instance-gp1-s.txt

uggla commented 3 years ago

Cloud provider : AWS Flavor: t2.micro (1vcpu, 1GB) Please let me know if you want other flavors.

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 63
model name  : Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
stepping    : 2
microcode   : 0x43
cpu MHz     : 2400.123
cache size  : 30720 KB
physical id : 0
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips    : 4800.13
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
uggla commented 3 years ago

Cloud provider : Azure Flavor: Standard D2s v3 vCPUs : 2 RAM: 8 GiB Please let me know if you want other flavors.

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 85
model name  : Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
stepping    : 7
microcode   : 0xffffffff
cpu MHz     : 2593.906
cache size  : 36608 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 21
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips    : 5187.81
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 85
model name  : Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
stepping    : 7
microcode   : 0xffffffff
cpu MHz     : 2593.906
cache size  : 36608 KB
physical id : 0
siblings    : 2
core id     : 0
cpu cores   : 1
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 21
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips    : 5187.81
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
uggla commented 3 years ago

Cloud provider : Ovh Flavor: B2-7 7 GB RAM 2 vCores (2.3 GHz) 50 GB SSD 250 Mbit/s Please let me know if you want other flavors.

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 60
model name  : Intel Core Processor (Haswell, no TSX)
stepping    : 1
microcode   : 0x1
cpu MHz     : 3099.986
cache size  : 16384 KB
physical id : 0
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear
vmx flags   : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid pml
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips    : 6199.97
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 60
model name  : Intel Core Processor (Haswell, no TSX)
stepping    : 1
microcode   : 0x1
cpu MHz     : 3099.986
cache size  : 16384 KB
physical id : 1
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear
vmx flags   : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid pml
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips    : 6199.97
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
mrchrisadams commented 3 years ago

Hi @bpetit,

I'm running a CX41 in the Helsinki datacentre with Hetzner - it's got 4VCPUs, 16gb RAM, and 160gb of RAM.I've got a 1tb block storage device attached to it too. See the attached output.You can see more stats on their page for their cloud servers.

You said something interesting here:

This seems to confirms that, most of the time, on most cloud providers, instances cpuinfo contains enough data to guess the baremetal hypervisor cpu model. This is great to move forward on that feature.

Would you elaborate a bit more here?

hetzner.helsinki.txt

bpetit commented 3 years ago

thanks a lot @pydubreucq @uggla and @mrchrisadams

Would you elaborate a bit more here?

What I imagine so far is feeding a database of some sort with data that contains the following data and their relationships:

  1. cloud provider
  2. instance type
  3. underlying hardware
  4. power consumption profile/characteristics of the hardware

With what we can learn from the providers you provided data from, 1,2,3 could be easily retrievable automatically from scaphandre.

Number 4. needs more work, but some part solutions already exist in publicly available data from vendors. There are also some initiatives to work on it such as one from our friends of the boavizta project (sorry the post is in french)

With such a database, using both those data and the resources consumption (cpu/ram/io/...) from the virtual machines, we could get to some estimations of the power consumed (4. is really critical here, as power efficiency may vary a lot from one cpu model to another).

Those are just abstract ideas for now, I'd really enjoy having your thoughts on this and even more to work to refine those ideas together :)

uggla commented 3 years ago

Hello @bpetit,

Will it be possible to have a wrap-up on this topic ? Can we draft the steps to implement this ?

bpetit commented 3 years ago

I've made a quick draw of what I imagine. We could think about the steps from then.

cloud_estimation

To describe it a little bit. The structure of the data (that is imagined on the right of the draw) is a key point here, but I'll keep a macro look to imagine the concepts first. I think it would be interesting to have that data in a VCS or equivalent, to allow contributions and reviews. Therefor I imagine a local DB, embedded in scaphandre, that would be a serialized/binary version of the structured/collaborative database or repository.

This is very macro and not clear, but it may enable first discussions on the topic.

uggla commented 3 years ago

@bpetit, thanks for this first draft. Here are some notes about it.

bpetit commented 3 years ago
> Regarding the local DB for scaphandre. I think a simple json containing the data will be enough. Keeping all this as text (not binary) will also help to manage them into a VCS. I don't think there will be a huge amount of data or performance reasons that will force us to use a binary file (I could be wrong).

I get your point but I don't think having it as a binary locally embedded in scaphandre would be an issue for versioning as the main version of the data could be in a VCS and will be the source of that tiny local DB (we could imagine building new releases of that binary DB every time there are important improvements in the centraliazed data repository). I think it's interessting to have the data in binary as it will still allow to use scaphandre as a single binary.

I think we should also think about people, with system that will not be able to connect the online repo. Providing a mechanism to use a private repo and/or disable the data update from a remote repo (manual injection of the data).

This is what I meant by having a central repository which is the real data, and "snapshots" of that data as a local database that is embedded in scaphandre. This way no remote communication, no risks of failure getting the data. We just need to inform the user of the snapshot version his/her version of scaphandre is using, so that he can update if needed.

I'm interested in the way you calculate the consumption and specifically the benchmark power consumptions. Are there data available from the boavizta project ? I looked at the project pages, but I was not able to find any data yet.

There are some data on spec.org. The rest of it is on websites of each server manufacturer. Boavizta has started aggregating metrics here but it wouldn't fit our use case in the current version:

But it can still be an interesting basis, maybe we should join forces on building a more generic and complete database.

There is also the work done by Teads that is of high interest. I'll discuss with the author to see how we could collaborate on building such dataset.

Regarding your previous post you explain we need these data: 1.cloud provider 2.instance type 3.underlying hardware 4.power consumption profile/characteristics of the hardware I wonder why we need 1,2. We can know the cpu model so we could deduce the consumption. Aren't we ? Do you think there will be differences between providers using the same kind of cpu ?

Actually yes. To be accurate regarding the estimation, knowing the CPU model may not be enough, for multiple reasons:

Do you think we can define a quick and dirty consumption model to experiment and draft implement something on that topic ? I think about a proportional consumption model per cpu type. Where: PowerCPUtypeX = (CurrentCPUtypeXUsage MaxPowerCPUtypeX) / (CPUtypeX_nb_of_core 100) PowerProcess = (CurrentProcessCPUusage MaxPowerCPUtypeX) / (CPUtypeX_nb_of_core 100)

I do think we will need to start small and provide rough estimations and be clear on their lack of accurateness, and then improve. But I think it will require a bit more than just max power consumption to give interesting result. I was thinking of something like idle CPU power consumption, 50% CPU power consumption and 100% CPU power consumption. I wonder also how the CPU time consumption relative to each core allocation is important for power consumption, but this is one of the potential topics boavizta is about. So I may have data or interesting models to share at some point (all the work in this group is supposed to be open sourced, but as it is volunteer work from all the members, I cannot imagine when we could effectively build something of that order.)

uggla commented 3 years ago

@bpetit thanks for your answers. I need to read the article from Teads that looks interesting. Should we have a meeting to discuss the ideas about this point with all interested people ?

bpetit commented 3 years ago

I have a meeting next Tuesday with some people interested from boavizta. Would you like to join ?

DarylSaucier commented 3 years ago

Hi,

I just wanted to present here a new source of possible errors in servers consumption estimation -because it was too easy, indeed. I found a few weeks ago this article explaining that hardware consume differently according to their fabrication. In fact, this paper shows that "Under the same load, power variation among identical system in the same rack can reach up to 7.8%". I invite you to read these two papers :

They show that random and non-controlable factors (disposition of the machine in the rack, cooling, fabrication...) may have an important impact on the power consumption of the machine. What I am saying is that the cloud estimation will be based on the activity of an emulation of physical components. But this components won't consume equally according to these factors, and thus there will always be a relatively important gap between the consumption retrieved on the VM and the real power consumed by the machine...

davidmytton commented 3 years ago

I recently handed over https://github.com/cloud-carbon-footprint/cloud-carbon-coefficients to the Cloud Carbon Footprint project which provides energy consumption coefficients for the various CPU architectures running on AWS, GCP and Azure.

The notebook does the calculations based on the SPECpower database, then groups them by CPU so they can be injected into the main project:

This then calculates the carbon footprint based on real-usage from cloud billing data.

bpetit commented 1 year ago

To give a bit of news on this, we are about to launch a collaborative database (and stress test protocol + aggregation process) of power consumption profile per hardware, with Boavizta : Energizta

Which is kind of the database part of the solution that has been described in this thread. Hope to find you numerous contributing to this project ! :)

bpetit commented 1 year ago

I recently handed over https://github.com/cloud-carbon-footprint/cloud-carbon-coefficients to the Cloud Carbon Footprint project which provides energy consumption coefficients for the various CPU architectures running on AWS, GCP and Azure.

The notebook does the calculations based on the SPECpower database, then groups them by CPU so they can be injected into the main project:

* [AWS](https://github.com/cloud-carbon-footprint/cloud-carbon-footprint/blob/trunk/packages/aws/src/domain/AwsFootprintEstimationConstants.ts)

* [GCP](https://github.com/cloud-carbon-footprint/cloud-carbon-footprint/blob/trunk/packages/gcp/src/domain/GcpFootprintEstimationConstants.ts)

* [Azure](https://github.com/cloud-carbon-footprint/cloud-carbon-footprint/blob/trunk/packages/azure/src/domain/AzureFootprintEstimationConstants.ts)

This then calculates the carbon footprint based on real-usage from cloud billing data.

To add on this, here is how this is estimated today in the BoaviztAPI.

The Energizta project I mentioned before is about providing better data for this kind of power usage modeling.