Setup infranodes & add more resources to ocp4

rbo commented 2 years ago

Idea:

Add two appNN.ocp4... nodes each with an GPU. (Remove node gpu, and move gpu from compute-0)
Use compute-0, compute-1 and compute-2 as Infra nodes

Infra node documentation: https://access.redhat.com/solutions/5034771

github-actions[bot] commented 2 years ago

Heads up @cluster/ocp3-admin - the "cluster/ocp3" label was applied to this issue.

DanielFroehlich commented 2 years ago

why are we doing this? Current worker nodes have 16 cores / 128G RAM - thats plenty of ressources. We have to watch overall HW utilitsation,with other clusters coming, we are starting to hit limits.

rbo commented 2 years ago

Nope its not:

$ oc describe no -l node-role.kubernetes.io/worker= |grep -A 7 "Allocated resources:"
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                12582m (81%)   18050m (116%)
  memory             48224Mi (37%)  74792Mi (58%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests       Limits
  --------           --------       ------
  cpu                15193m (98%)   19700m (127%)
  memory             64547Mi (50%)  77484Mi (60%)
  ephemeral-storage  0 (0%)         0 (0%)
  hugepages-2Mi      0 (0%)         0 (0%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                3594m (23%)  5 (32%)
  memory             9816Mi (7%)  9440Mi (7%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
--
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                3065m (87%)   10800m (308%)
  memory             7861Mi (52%)  17252Mi (115%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)

We have not much workload on it and it's hard to update or change something because the cluster is busy with itself. (OCS, Logging,...)

Might be useful to join some clusters to have more resources available and fewer resources for the control plane. For example: join ocp5 & ocp4 because AI/ML & VM workload is more fun with OCS. Just an idea, we have to discuss in detail on a next stormshift call or via gchat.

github-actions[bot] commented 2 years ago

Heads up @cluster/ocp4-admin - the "cluster/ocp4" label was applied to this issue.

DanielFroehlich commented 2 years ago

master are not schedulable, additional gpu worker node is up and running, I am closing this issue for now.

stormshift / support

Setup infranodes & add more resources to ocp4 #52