Get hardware with avx2 or avx512 capabilities in smaug instance

pacospace commented 3 years ago

Is your feature request related to a problem? Please describe. Some ML models are optimized for certain architectures. It would be nice to get hardware with avx2 or avx512 capabilities in smaug instance.

Describe the solution you'd like

[ ] Verify if it would be possible to get some hardware with avx2 or avx512 in smaug instance.

Describe alternatives you've considered Deploy on rick cluster which has hardware with avx512.

Additional context Related-To: https://github.com/operate-first/support/issues/409 Related-To: https://github.com/operate-first/support/issues/408 Related-To: https://github.com/AICoE/elyra-aidevsecops-tutorial/issues/297#issuecomment-934217223

from cat /proc/cpuinfo in a pod from smaug instance I get:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 62
model name      : Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d

source for this chip shows that Intel(R) Xeon(R) CPU E5-2667 v2 does not support avx2 but only avx. avx2 is available in Intel(R) Xeon(R) CPU E5-2667 >=v3: https://www.cpu-world.com/Compare/422/Intel_Xeon_E5-2667_v2_vs_Intel_Xeon_E5-2667_v3.html but not avx512.

cc @riekrh @durandom @goern

pacospace commented 3 years ago

It was ocp4 internal cluster with avx512, avx2 only is on rick. sorry for confusion

ocp4

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities

rick

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d

tumido commented 3 years ago

Hey @rbo,

WDYT, can we request additional machines at Hetzner and plug it into Rick for this? Right now rick can do avx2 but it can't avx512. Is it possible to request machines with that cpu flag and add it to the existing cluster?

rbo commented 3 years ago

Let me check next week, we have a limitation of the number of nodes because of operate-first/hetzner-baremetal-openshift/issues/8 . If we can not add more nodes we can replace one. But let me check next week.

rbo commented 3 years ago

Still "next week" but unfortunately Friday :-(

We can not add more nodes to the Rick cluster because of limitations at Hetzner and/or OpenShift.

Hetzner: Only support 10 firewall rules per host to protect the servers from the rest of the world. With 6 Nodes (3 Master + 3 Compute) we reached the limit: with operate-first/hetzner-baremetal-openshift/issues/8 we try to figure out a solution. There is no solution.
- OpenShift: Currently we can not move the OpenShift internal traffic to an internal network interface and use the primary/public one only for internet access. It looks like with one of the next OpenShift releases that might be possible. Tracked in operate-first/hetzner-baremetal-openshift/issues/9

The only option I can imagine is we replace all worker nodes with a new one they have the feature step by step. The only risky part is the OCS/ODF storage.

goern commented 3 years ago

What about going ahead and replacing the workload cluster's node with beefier machines? @durandom wdyt?

rbo commented 3 years ago

Current node overview:

Node: host05,host06,host07
CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz (12 Cores)
RAM: 256 GB
Disks: 2x 480GB SATA SSD
Price per month (excl VAR) each node: 78.9916
Setup fee once per order: 0.0 because ordered via Serverbörse.
Price for all three nodes: round-a-bout: 240,-

Current usage:

$ oc describe no -l node-role.kubernetes.io/worker= |grep -A 7 "Allocated resources:"
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests       Limits
  --------                       --------       ------
  cpu                            11416m (99%)   11 (95%)
  memory                         29060Mi (11%)  26876Mi (10%)
  ephemeral-storage              0 (0%)         0 (0%)
  hugepages-1Gi                  0 (0%)         0 (0%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests       Limits
  --------                       --------       ------
  cpu                            11259m (97%)   11800m (102%)
  memory                         27741Mi (10%)  27798Mi (10%)
  ephemeral-storage              0 (0%)         0 (0%)
  hugepages-1Gi                  0 (0%)         0 (0%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                       Requests           Limits
  --------                       --------           ------
  cpu                            11319m (98%)       14500m (126%)
  memory                         30545018882 (11%)  40982544386 (15%)
  ephemeral-storage              100M (0%)          0 (0%)
  hugepages-1Gi                  0 (0%)             0 (0%)

Potential new node options

At Serverbörse, there are no machines available with the CPU feature avx512 .

nly option we have, choose PX93 with a CPU Intel® Xeon® W-2295 18-Core - I guess avx512 is available, can anyone confirm? @pacospace ?

Pricing, incl VAT:

CPU: CPU Intel® Xeon® W-2295 18-Core

RAM	Disk	Price per month (excl VAR)	Setup fee once per order	Price per month for 3 Nodes
256 GB RAM	1x 480 SATA SSD, 1x 960 GB NVME SSD	223.13	141.96	669.39
512 GB RAM	1x 480 SATA SSD, 1x 960 GB NVME SSD	380.21	141.96	1140.63
256 GB RAM	1x 480 SATA SSD, 1x 1.92 TB NVME SSD	233.24	141.96	699.72
512 GB RAM	1x 480 SATA SSD, 1x 1.92 TB NVME SSD	390.32	141.96	1170.96

RAM is expensive, not the disks.

Based on the RAM consumption above, I suggest 256GB, which means 14 GB RAM per core (256/18) which should be enough. And the 1x 1.92 TB NVME SSD version. (233.24 / 233.24 per month )

ToDos:

[ ] Can any check if the CPU is sufficient?
[ ] Budget approval - @durandom

durandom commented 3 years ago

I'm ok with spending more. Would this mean we can move some thoth services from the expensive balrog to rick?

And before we do this, I'd like to understand overall utilization of the clusters better, which is somewhat blocked by the diagrams that @HumairAK is working on :)

sesheta commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

rbo commented 2 years ago

Our new Morty cluster supports/have a CPU with AVX2 feature:

$ oc get nodes -L cpu-feature.node.kubevirt.io/avx2 -L cpu-feature.node.kubevirt.io/avx -L cpu-feature.node.kubevirt.io/avx512
NAME                                               STATUS   ROLES          AGE   VERSION           AVX2   AVX    AVX512
morty-compute-0-private.emea.operate-first.cloud   Ready    worker         14d   v1.22.3+fdba464   true   true
morty-compute-1-private.emea.operate-first.cloud   Ready    worker         14d   v1.22.3+fdba464   true   true
morty-compute-2-private.emea.operate-first.cloud   Ready    worker         14d   v1.22.3+fdba464   true   true
morty-master-0-private.emea.operate-first.cloud    Ready    master         14d   v1.22.3+fdba464
morty-master-1-private.emea.operate-first.cloud    Ready    master         14d   v1.22.3+fdba464
morty-master-2-private.emea.operate-first.cloud    Ready    master         14d   v1.22.3+fdba464
morty-storage-0-private.emea.operate-first.cloud   Ready    infra,worker   14d   v1.22.3+fdba464
morty-storage-1-private.emea.operate-first.cloud   Ready    infra,worker   14d   v1.22.3+fdba464
morty-storage-2-private.emea.operate-first.cloud   Ready    infra,worker   14d   v1.22.3+fdba464
$

rbo commented 2 years ago

@pacospace feel free to create a ticket or pr to onboard on morty cluster. I will close this ticket, feel free to reopen it if needed.

pacospace commented 2 years ago

@pacospace feel free to create a ticket or pr to onboard on morty cluster. I will close this ticket, feel free to reopen it if needed.

Great timing, NM introduced new speeds up also on old CPU with avx2 only, with version 0.11.0: https://github.com/neuralmagic/deepsparse/releases/ Thanks a lot @rbo!

operate-first / support