VanillaSpoon commented 1 year ago

Issue link

fixes https://github.com/project-codeflare/instascale/issues/80 closes https://github.com/project-codeflare/instascale/issues/175

This is a pr to allow for instascale integration with Hypershift.

What changes have been made

Addition of OCM manager, this is to support the use of ocmConnections for both OSD and Hypershift.
Addition of Nodepool scaling logic.
General Refactor.

The pr also contains changes to remove irrelevant logging & repeated loop logging

Verification steps

Create a hypershift cluster. Deploy the stack. Create an appwrapper instascale enabled, and ensure the nodepool successfully scales up and down.

Checks

[ ] I've made sure the tests are passing.
Testing Strategy
- [ ] Unit tests
- [x] Manual tests
- [ ] Testing is not required for this change

VanillaSpoon commented 1 year ago

Tested this out on my HyperShift cluster and I have gotten this error when following the InstaScale guided demo

I1108 14:05:43.091423 1 nodepools.go:46] The instanceRequired array: m5.xlarge
I1108 14:05:43.091445 1 nodepools.go:54] Built NodePool with instance type m5.xlarge and name instascaletest-m5-xlarge
E1108 14:05:43.096248 1 nodepools.go:57] Error creating NodePool: status is 400, identifier is '400', code is 'CLUSTERS-MGMT-400' and operation identifier is 'dac380af-7a33-4b1a-a6aa-dcb353237a44': NodePool name 'instascaletest-m5-xlarge' is 24 characters long - its length exceeds the maximum length allowed of 15 characters
I1108 14:05:43.096274 1 nodepools.go:59] Created NodePool: &{400 map[Content-Type:[application/json] Date:[Wed, 08 Nov 2023 14:05:43 GMT] Server:[envoy] Vary:[Accept-Encoding] X-Envoy-Upstream-Service-Time:[2] X-Operation-Id:[dac380af-7a33-4b1a-a6aa-dcb353237a44]] 0xc00068d8f0 <nil>}
I1108 14:05:43.170794 1 nodepools.go:46] The instanceRequired array: g4dn.xlarge
I1108 14:05:43.170815 1 nodepools.go:54] Built NodePool with instance type g4dn.xlarge and name instascaletest-g4dn-xlarge
E1108 14:05:43.175629 1 nodepools.go:57] Error creating NodePool: status is 400, identifier is '400', code is 'CLUSTERS-MGMT-400' and operation identifier is 'd573d9d0-598b-494a-bc9c-cc7a5ba2c1dd': NodePool name 'instascaletest-g4dn-xlarge' is 26 characters long - its length exceeds the maximum length allowed of 15 characters
I1108 14:05:43.175646 1 nodepools.go:59] Created NodePool: &{400 map[Content-Type:[application/json] Date:[Wed, 08 Nov 2023 14:05:43 GMT] Server:[envoy] Vary:[Accept-Encoding] X-Envoy-Upstream-Service-Time:[2] X-Operation-Id:[d573d9d0-598b-494a-bc9c-cc7a5ba2c1dd]] 0xc00068da40 <nil>}

Despite the odd error about the name being too long it says it created the nodepool. I checked out the Nodes in the compute section and no new nodes were created.

Hey @Bobbins228 , this is something I've raised an issue for, the nodepool name character limit is 15, so after appending the node type the name quickly surpasses the limit. This is also the case for machinepools, with a character limit of 30.

I'm hoping to address this soon, however, for now you're right, it should at least discontinue the function to build the nodepool when the error is returned.

Specifically with nodepools, after appending the additional information to the name, even a 4 character name can exceed the limit. so this may require a new naming convention

VanillaSpoon commented 11 months ago

Ran through the nodepools e2e test using this PR on a Hypershift Cluster. Functionality works as expected. This is great!! :) Only comment is that the logs are a bit crowded from the finalizeScalingDownMachines func being called in the reconcile function.

Hey @Fiona-Waters, Thanks for pointing this out. I have replicated the issue here too. You are correct, this is due to finalizeScalingDownMachines func being called in reconcile. Would it be appropriate to add `hasCompletedScaleDown' or an equivalent as a field in the appwrapper?

Fiona-Waters commented 11 months ago

Ran through the nodepools e2e test using this PR on a Hypershift Cluster. Functionality works as expected. This is great!! :) Only comment is that the logs are a bit crowded from the finalizeScalingDownMachines func being called in the reconcile function.

Hey @Fiona-Waters, Thanks for pointing this out. I have replicated the issue here too. You are correct, this is due to finalizeScalingDownMachines func being called in reconcile. Would it be appropriate to add `hasCompletedScaleDown' or an equivalent as a field in the appwrapper?

Maybe we could check the status of the node pool and if it is "deleting" then print the log line. I guess we would have to do this for machine pools and machine sets too. Seems like a bit of work to remove some extra log lines so of course up to you how to proceed. I'm not sure about adding a field to the appwrapper, would this only update after the node pool has completed deleted and then may not remove any log lines?

Fiona-Waters commented 11 months ago

Re-ran on a hypershift cluster, can confirm that it works as expected. Great work Eoin! /lgtm

VanillaSpoon commented 11 months ago

~~Moving this to WIP temporarily as nodes appear to be re-scaling after deletion in Hypershift.~~

openshift-ci[bot] commented 9 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: anishasthana

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/project-codeflare/instascale/blob/main/OWNERS)~~ [anishasthana] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

project-codeflare / instascale

add: integration with hypershift #129

Issue link

What changes have been made

Verification steps

Checks