Closed pacospace closed 2 years ago
It was ocp4 internal cluster with avx512
, avx2
only is on rick. sorry for confusion
ocp4
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat umip pku ospke avx512_vnni md_clear arch_capabilities
rick
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts md_clear flush_l1d
Hey @rbo,
WDYT, can we request additional machines at Hetzner and plug it into Rick for this? Right now rick can do avx2
but it can't avx512
. Is it possible to request machines with that cpu flag and add it to the existing cluster?
Let me check next week, we have a limitation of the number of nodes because of operate-first/hetzner-baremetal-openshift/issues/8 . If we can not add more nodes we can replace one. But let me check next week.
Still "next week" but unfortunately Friday :-(
We can not add more nodes to the Rick cluster because of limitations at Hetzner and/or OpenShift.
The only option I can imagine is we replace all worker nodes with a new one they have the feature step by step. The only risky part is the OCS/ODF storage.
What about going ahead and replacing the workload cluster's node with beefier machines? @durandom wdyt?
Current usage:
$ oc describe no -l node-role.kubernetes.io/worker= |grep -A 7 "Allocated resources:"
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 11416m (99%) 11 (95%)
memory 29060Mi (11%) 26876Mi (10%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
--
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 11259m (97%) 11800m (102%)
memory 27741Mi (10%) 27798Mi (10%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
--
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 11319m (98%) 14500m (126%)
memory 30545018882 (11%) 40982544386 (15%)
ephemeral-storage 100M (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
At Serverbörse, there are no machines available with the CPU feature avx512
.
nly option we have, choose PX93 with a CPU Intel® Xeon® W-2295 18-Core - I guess avx512 is available, can anyone confirm? @pacospace ?
Pricing, incl VAT:
CPU: CPU Intel® Xeon® W-2295 18-Core
RAM | Disk | Price per month (excl VAR) | Setup fee once per order | Price per month for 3 Nodes |
---|---|---|---|---|
256 GB RAM | 1x 480 SATA SSD, 1x 960 GB NVME SSD | 223.13 | 141.96 | 669.39 |
512 GB RAM | 1x 480 SATA SSD, 1x 960 GB NVME SSD | 380.21 | 141.96 | 1140.63 |
256 GB RAM | 1x 480 SATA SSD, 1x 1.92 TB NVME SSD | 233.24 | 141.96 | 699.72 |
512 GB RAM | 1x 480 SATA SSD, 1x 1.92 TB NVME SSD | 390.32 | 141.96 | 1170.96 |
RAM is expensive, not the disks.
Based on the RAM consumption above, I suggest 256GB, which means 14 GB RAM per core (256/18) which should be enough. And the 1x 1.92 TB NVME SSD version. (233.24 / 233.24 per month )
I'm ok with spending more. Would this mean we can move some thoth services from the expensive balrog to rick?
And before we do this, I'd like to understand overall utilization of the clusters better, which is somewhat blocked by the diagrams that @HumairAK is working on :)
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
/lifecycle stale
Our new Morty cluster supports/have a CPU with AVX2 feature:
$ oc get nodes -L cpu-feature.node.kubevirt.io/avx2 -L cpu-feature.node.kubevirt.io/avx -L cpu-feature.node.kubevirt.io/avx512
NAME STATUS ROLES AGE VERSION AVX2 AVX AVX512
morty-compute-0-private.emea.operate-first.cloud Ready worker 14d v1.22.3+fdba464 true true
morty-compute-1-private.emea.operate-first.cloud Ready worker 14d v1.22.3+fdba464 true true
morty-compute-2-private.emea.operate-first.cloud Ready worker 14d v1.22.3+fdba464 true true
morty-master-0-private.emea.operate-first.cloud Ready master 14d v1.22.3+fdba464
morty-master-1-private.emea.operate-first.cloud Ready master 14d v1.22.3+fdba464
morty-master-2-private.emea.operate-first.cloud Ready master 14d v1.22.3+fdba464
morty-storage-0-private.emea.operate-first.cloud Ready infra,worker 14d v1.22.3+fdba464
morty-storage-1-private.emea.operate-first.cloud Ready infra,worker 14d v1.22.3+fdba464
morty-storage-2-private.emea.operate-first.cloud Ready infra,worker 14d v1.22.3+fdba464
$
@pacospace feel free to create a ticket or pr to onboard on morty cluster. I will close this ticket, feel free to reopen it if needed.
@pacospace feel free to create a ticket or pr to onboard on morty cluster. I will close this ticket, feel free to reopen it if needed.
Great timing, NM introduced new speeds up also on old CPU with avx2 only, with version 0.11.0: https://github.com/neuralmagic/deepsparse/releases/ Thanks a lot @rbo!
Is your feature request related to a problem? Please describe. Some ML models are optimized for certain architectures. It would be nice to get hardware with
avx2
oravx512
capabilities in smaug instance.Describe the solution you'd like
Describe alternatives you've considered Deploy on rick cluster which has hardware with avx512.
Additional context Related-To: https://github.com/operate-first/support/issues/409 Related-To: https://github.com/operate-first/support/issues/408 Related-To: https://github.com/AICoE/elyra-aidevsecops-tutorial/issues/297#issuecomment-934217223
from
cat /proc/cpuinfo
in a pod from smaug instance I get:source for this chip shows that Intel(R) Xeon(R) CPU E5-2667 v2 does not support
avx2
but onlyavx
. avx2 is available in Intel(R) Xeon(R) CPU E5-2667 >=v3: https://www.cpu-world.com/Compare/422/Intel_Xeon_E5-2667_v2_vs_Intel_Xeon_E5-2667_v3.html but notavx512
.cc @riekrh @durandom @goern