Open ArneTR opened 6 months ago
Hi Arne, concering the potential issues: We are going to discuss on 30.5., but a little feedback beforehand:
Another problem we faced: On our "playground" Gitlab the CPU is not clearly specified, see output below:
$ cat /proc/cpuinfo
++ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel Core Processor (Broadwell)
stepping : 2
microcode : 0x1
cpu MHz : 2099.998
cache size : 16384 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds mmio_unknown bhi
bogomips : 4199.99
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel Core Processor (Broadwell)
stepping : 2
microcode : 0x1
cpu MHz : 2099.998
cache size : 16384 KB
physical id : 1
siblings : 1
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds mmio_unknown bhi
bogomips : 4199.99
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel Core Processor (Broadwell)
stepping : 2
microcode : 0x1
cpu MHz : 2099.998
cache size : 16384 KB
physical id : 2
siblings : 1
core id : 0
cpu cores : 1
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds mmio_unknown bhi
bogomips : 4199.99
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel Core Processor (Broadwell)
stepping : 2
microcode : 0x1
cpu MHz : 2099.998
cache size : 16384 KB
physical id : 3
siblings : 1
core id : 0
cpu cores : 1
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds mmio_unknown bhi
bogomips : 4199.99
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
++ echo '$ lscpu'
++ lscpu
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Intel Core Processor (Broadwell)
CPU family: 6
Model: 61
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 4
Stepping: 2
BogoMIPS: 4199.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 128 KiB (4 instances)
L1i cache: 128 KiB (4 instances)
L2 cache: 16 MiB (4 instances)
L3 cache: 64 MiB (4 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Unknown: No mitigations
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Vulnerability Srbds: Unknown: Dependent on hypervisor status
Vulnerability Tsx async abort: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Compare output on another machine:
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
CPU family: 6
Model: 61
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
Stepping: 4
CPU(s) scaling MHz: 57%
CPU max MHz: 2700,0000
CPU min MHz: 500,0000
BogoMIPS: 4389,62
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb r
dtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est
tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowpref
etch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep
bmi2 erms invpcid rdseed adx smap intel_pt xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 64 KiB (2 instances)
L1i: 64 KiB (2 instances)
L2: 512 KiB (2 instances)
L3: 3 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-3
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Mitigation: Split huge pages
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Meltdown: Mitigation; PTI
Mmio stale data: Unknown: No mitigations
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Srbds: Mitigation; Microcode
Tsx async abort: Not affected
We think this might be a problem for model accuracy. Will knowing the TDP help that, or is it a general problem? Do you have experience with this problem?
This looks like an Intel Core i7-5650U from the specs. However, since you are operating within a virtualized environment, the exact physical CPU could be different and the hypervisor is abstracting and presenting it as a Broadwell processor. But I would assume it is fair to take the i7 as a reference for the energy usage. This is a fundamental problem with virtualises environments that you never really know what is underneath to hood and can't really be solved IMHO.
So how do we overrule the hyporvisor and present Intel Core i7-5650U to Eco-CI?
This is an example of the variables you need to determine for your box, if the specs are different to the know shared Gitlab / Github default runners: https://github.com/green-coding-solutions/eco-ci-energy-estimation/blob/7333e9a8e8036e5bf69d59dc8bb63d984b0f55c8/scripts/vars.sh#L150C1-L162C54
If you just post the variables here in the chat I will make a demo integration ready how this would be supplied.
Since you are going for a full custom system with also a masked hypervisor CPUID we would add a functionality that you can specify all of these variables in the ci workflow file directly as variables.
For these custom cases this is more suitable I think
Hey @jochen-schuettler ,
we have a new variant https://github.com/green-coding-solutions/eco-ci-energy-estimation/pull/76 which removes all of the dependencies, needs no docker at all and just uses basic linux commandline tools.
Happy for your feedback / opinion
Hi, we tested this new approach and it works on our Gitlab test-system. Moving the installation of packages out of the .eco-ci-gitlab.yml is a good idea to be distro agnostic. The installation of those takes around 12 seconds on this system. Maybe caching helps reducing the needed time. The planned loading mechanism for loading custom machines power data for the future is a desirable feature, because each data file has ~450kb and will bloat the repository otherwise, when lots of them get added.
This looks like an Intel Core i7-5650U from the specs.
We know now that it is a XEON E5-2620 v4. Generating the power data file and using it by replacing the default.sh for quick testing worked so far.
The planned loading mechanism for loading custom machines power data for the future is a desirable feature, because each data file has ~450kb and will bloat the repository otherwise, when lots of them get added.
I agree. The idea is to have all default machines from GitHub and GitLab shipped in repository, which should equate to around 10 files. Everything else would then go through a custom route.
In different discussions off-Github and on-Github a problem with Eco-CI came to light:
github.com/cache
action can be used. On Gitlab it does not work atm. See issueWhat comes to mind is packaging Eco-CI in a docker image and running it from there.
Looping in @anitaschuettler and @jochen-schuettler and @ribalba
Some potential issues need to be discussed before integration:
docker
CLI available on Github? To my knowlege the docker-in-docker that is needed for Gitlab is not the default?/proc/stat
readable? Especially when using a different container runtime this directory might not be readable or it might not contain all CPUs of the system as it might be hidden ... [Example Nestybox which is typically used for stronger isolation]The docker container might solve many issues like the current cost of installation, non-root capabilities but also be more versatile.