neoave / mrack

Multicloud use-case based multihost async provisioner for CIs and testing during development
Apache License 2.0
11 stars 14 forks source link

feat: Do not use same sleep for every mrack run #229

Closed Tiboris closed 1 year ago

Tiboris commented 1 year ago

feat: Do not use same sleep for every mrack run

Adding logic to randomize the timeout to basically
set the waiting timeout to vary from 45-75 minutes randomly
and the poll time to 45-75 seconds accordingly
so if there are multiple parallel runs of mrack they wait
different amount time before poll and increase chance to get resources
quicker than other concurrent runs of mrack requests

Add cooldown after error host was found to wait before
another retry and resource check cycle for random time
between one poll sleep and 2* poll sleep (45-75 s is one poll)

Add utilization method template to decide whether to
destroy all machines or to keep them while re-provisioning
to free some resources when there is provided highly utilized.

feat(AWS): Add utilization check method

feat(OpenStack): Add utilization check method

minor fix: typo _opentack -> _openstack

refactor(OpenStack): make private openstack methods truly private

renaming the openstack private methods to be private
openstack had too many public methods and this seems
to be proper fix, to have public just the needed ones.

Signed-off-by: Tibor Dudlák tdudlak@redhat.com

Tiboris commented 1 year ago

Oh i forgot to rewrite calls in test code. Will do

Tiboris commented 1 year ago

Failure cauised by:

Error: 
 Problem: package mrack-1.12.3-4.el8.noarch requires python3-mrack-beaker = 1.12.3-4.el8, but none of the providers can be installed
  - cannot install the best candidate for the job
  - nothing provides beaker-client needed by python3-mrack-beaker-1.12.3-4.el8.noarch

Possible resolution is to suggest or recommend the package instead of require to workaround such errors.

Tiboris commented 1 year ago

/packit test

Tiboris commented 1 year ago

Depends on https://github.com/neoave/mrack/pull/233

dav-pascual commented 1 year ago

Hey @Tiboris These changes look good at first sight. Could you help me understand the different sleep times that you've put in place and what they solve?

Please correct me or complement the above :slightly_smiling_face:

Tiboris commented 1 year ago

@dav-pascual thanks for review and questions.

  • await asyncio.sleep(cooldown)in strategy_retry: In this case we haven't been able to provision all resources despite the previous waits, so we probably are in a race condition with concurrent runs, and thus we free all resources and try again in cooldown minutes.

I would like to correct this one the actual cooldown now is in seconds, this can be edited if needed and if it makes sense.

As an answer to other questions:

With concurrent runs its hard to say how many hosts or what will be the percentage of provisioned i am afraid its very non-exact here removing all is destructive, I agree, however helps jobs in case all have 1 hosts provisioned only (also speculative statement with small change of reproducing).

I will use your very nicely formulated assumtions which are basicaly 99% correct and add comments to the code with it.

Again thanks!

dav-pascual commented 1 year ago

@dav-pascual thanks for review and questions.

  • await asyncio.sleep(cooldown)in strategy_retry: In this case we haven't been able to provision all resources despite the previous waits, so we probably are in a race condition with concurrent runs, and thus we free all resources and try again in cooldown minutes.

I would like to correct this one the actual cooldown now is in seconds, this can be edited if needed and if it makes sense.

As an answer to other questions:

With concurrent runs its hard to say how many hosts or what will be the percentage of provisioned i am afraid its very non-exact here removing all is destructive, I agree, however helps jobs in case all have 1 hosts provisioned only (also speculative statement with small change of reproducing).

I will use your very nicely formulated assumtions which are basicaly 99% correct and add comments to the code with it.

Again thanks!

@Tiboris Thanks for adding the comments :slightly_smiling_face:

Your answer makes sense and the rest looks good so I will approve the PR

Tiboris commented 1 year ago

Thanks for the review @dav-pascual