Docker container for Eco-CI

ArneTR commented 6 months ago

In different discussions off-Github and on-Github a problem with Eco-CI came to light:

When using a different base image the installation might fail as the underlying OS / Distribution might not support the Ubuntu / Debian packages we want to install
- This is even greater problematic when it comes to Gitlab, as some images there might not even have root rights and thus no installation is possible at all. See example for KDE Okular integration we tried
Generally the installation of the packages is slow and even when we use caching it might not work necessarily when the integrated github.com/cache action can be used. On Gitlab it does not work atm. See issue
Also some systems might not allow root access, as mentioned before, so we cannot install anyway

What comes to mind is packaging Eco-CI in a docker image and running it from there.

Looping in @anitaschuettler and @jochen-schuettler and @ribalba

Some potential issues need to be discussed before integration:

Can we always expect to have the docker CLI available on Github? To my knowlege the docker-in-docker that is needed for Gitlab is not the default?
Can we always expect to have /proc/stat readable? Especially when using a different container runtime this directory might not be readable or it might not contain all CPUs of the system as it might be hidden ... [Example Nestybox which is typically used for stronger isolation]
How much time does it take to download the docker image vs using the cached install mechanism atm?
How many different containers do we have to create in the end? Idea is to make a trimmed down Ubuntu/Python container ... but what architectures do we need to support?

The docker container might solve many issues like the current cost of installation, non-root capabilities but also be more versatile.

jochen-schuettler commented 6 months ago

Hi Arne, concering the potential issues: We are going to discuss on 30.5., but a little feedback beforehand:

You need to find out, we can't help.
You already said so, it's possible that it is not readable.
We are sure, installing with each job every time the pipeline runs takes much longer than pulling a deduced image that changes seldomly. The install caching is not active across jobs with different images. But pulled images are cached.
Let's talk on that.

jochen-schuettler commented 6 months ago

Another problem we faced: On our "playground" Gitlab the CPU is not clearly specified, see output below:

$ cat /proc/cpuinfo
++ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 61
model name  : Intel Core Processor (Broadwell)
stepping    : 2
microcode   : 0x1
cpu MHz     : 2099.998
cache size  : 16384 KB
physical id : 0
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds mmio_unknown bhi
bogomips    : 4199.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 61
model name  : Intel Core Processor (Broadwell)
stepping    : 2
microcode   : 0x1
cpu MHz     : 2099.998
cache size  : 16384 KB
physical id : 1
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 1
initial apicid  : 1
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds mmio_unknown bhi
bogomips    : 4199.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model       : 61
model name  : Intel Core Processor (Broadwell)
stepping    : 2
microcode   : 0x1
cpu MHz     : 2099.998
cache size  : 16384 KB
physical id : 2
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 2
initial apicid  : 2
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds mmio_unknown bhi
bogomips    : 4199.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model       : 61
model name  : Intel Core Processor (Broadwell)
stepping    : 2
microcode   : 0x1
cpu MHz     : 2099.998
cache size  : 16384 KB
physical id : 3
siblings    : 1
core id     : 0
cpu cores   : 1
apicid      : 3
initial apicid  : 3
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds mmio_unknown bhi
bogomips    : 4199.99
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
++ echo '$ lscpu'
++ lscpu
$ lscpu
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        40 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               4
On-line CPU(s) list:                  0-3
Vendor ID:                            GenuineIntel
Model name:                           Intel Core Processor (Broadwell)
CPU family:                           6
Model:                                61
Thread(s) per core:                   1
Core(s) per socket:                   1
Socket(s):                            4
Stepping:                             2
BogoMIPS:                             4199.99
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            128 KiB (4 instances)
L1i cache:                            128 KiB (4 instances)
L2 cache:                             16 MiB (4 instances)
L3 cache:                             64 MiB (4 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-3
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Meltdown:               Mitigation; PTI
Vulnerability Mmio stale data:        Unknown: No mitigations
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Retpoline
Vulnerability Srbds:                  Unknown: Dependent on hypervisor status
Vulnerability Tsx async abort:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown

Compare output on another machine:

lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          39 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                GenuineIntel
  Model name:             Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz
    CPU family:           6
    Model:                61
    Thread(s) per core:   2
    Core(s) per socket:   2
    Socket(s):            1
    Stepping:             4
    CPU(s) scaling MHz:   57%
    CPU max MHz:          2700,0000
    CPU min MHz:          500,0000
    BogoMIPS:             4389,62
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb r
                          dtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est
                           tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowpref
                          etch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep
                           bmi2 erms invpcid rdseed adx smap intel_pt xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Virtualization features:  
  Virtualization:         VT-x
Caches (sum of all):      
  L1d:                    64 KiB (2 instances)
  L1i:                    64 KiB (2 instances)
  L2:                     512 KiB (2 instances)
  L3:                     3 MiB (1 instance)
NUMA:                     
  NUMA node(s):           1
  NUMA node0 CPU(s):      0-3
Vulnerabilities:          
  Gather data sampling:   Not affected
  Itlb multihit:          KVM: Mitigation: Split huge pages
  L1tf:                   Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                    Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:               Mitigation; PTI
  Mmio stale data:        Unknown: No mitigations
  Reg file data sampling: Not affected
  Retbleed:               Not affected
  Spec rstack overflow:   Not affected
  Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                  Mitigation; Microcode
  Tsx async abort:        Not affected

We think this might be a problem for model accuracy. Will knowing the TDP help that, or is it a general problem? Do you have experience with this problem?

ribalba commented 6 months ago

This looks like an Intel Core i7-5650U from the specs. However, since you are operating within a virtualized environment, the exact physical CPU could be different and the hypervisor is abstracting and presenting it as a Broadwell processor. But I would assume it is fair to take the i7 as a reference for the energy usage. This is a fundamental problem with virtualises environments that you never really know what is underneath to hood and can't really be solved IMHO.

jochen-schuettler commented 6 months ago

So how do we overrule the hyporvisor and present Intel Core i7-5650U to Eco-CI?

ArneTR commented 6 months ago

This is an example of the variables you need to determine for your box, if the specs are different to the know shared Gitlab / Github default runners: https://github.com/green-coding-solutions/eco-ci-energy-estimation/blob/7333e9a8e8036e5bf69d59dc8bb63d984b0f55c8/scripts/vars.sh#L150C1-L162C54

If you just post the variables here in the chat I will make a demo integration ready how this would be supplied.

Since you are going for a full custom system with also a masked hypervisor CPUID we would add a functionality that you can specify all of these variables in the ci workflow file directly as variables.

For these custom cases this is more suitable I think

ArneTR commented 5 months ago

Hey @jochen-schuettler ,

we have a new variant https://github.com/green-coding-solutions/eco-ci-energy-estimation/pull/76 which removes all of the dependencies, needs no docker at all and just uses basic linux commandline tools.

Happy for your feedback / opinion

timoschroeder1213 commented 5 months ago

Hi, we tested this new approach and it works on our Gitlab test-system. Moving the installation of packages out of the .eco-ci-gitlab.yml is a good idea to be distro agnostic. The installation of those takes around 12 seconds on this system. Maybe caching helps reducing the needed time. The planned loading mechanism for loading custom machines power data for the future is a desirable feature, because each data file has ~450kb and will bloat the repository otherwise, when lots of them get added.

timoschroeder1213 commented 5 months ago

This looks like an Intel Core i7-5650U from the specs.

We know now that it is a XEON E5-2620 v4. Generating the power data file and using it by replacing the default.sh for quick testing worked so far.

ArneTR commented 5 months ago

The planned loading mechanism for loading custom machines power data for the future is a desirable feature, because each data file has ~450kb and will bloat the repository otherwise, when lots of them get added.

I agree. The idea is to have all default machines from GitHub and GitLab shipped in repository, which should equate to around 10 files. Everything else would then go through a custom route.

What would be next steps forward for you and requirements for trying this out with a potential client project as discussed in the last call? You want to work with a forked version of the repo, where you patch the default.sh according to your needs? What features do you need to put it to a trial?

green-coding-solutions / eco-ci-energy-estimation

Docker container for Eco-CI #70