Proposal: Increase default wait time for dependencies

davidkopp commented 4 months ago

I'm currently adding the healthcheck feature to my microservices reference application. Compose file: https://github.com/t2-project/devops/blob/main/energy-tests/gmt/microservices-compose.yml

Unfortunately, the configured 20 second wait time for dependencies is not sufficient on my machine. The inventory service (Spring Boot) needs more than 20 seconds to become healthy.

Of course on my local machine I can change the config on my own. However, that would not be possible using the measurement cluster. An workaround is to use a sleep in setup-command, how I did it already in the past without the healthcheck feature.

I think it makes sense to increase the configured default wait time for dependencies, because 20 seconds seems to be too short for some Spring Boot applications in a microservices system to become healthy.

ArneTR commented 4 months ago

Although doing this can be done I think it is a patch which might not finally solve the issue.

The way to implement what you are shooting for is a proper healthcheck including a health-cmd

The implementation atm only supports interval which will make the container being polled constantly and just create measurement noise.

At the time when I implemented the feature the docker engine did not support start-interval. Now, with Docker Engine 25 it seems this is now possible to do! (Compare comment here: https://docs.docker.com/reference/dockerfile/#healthcheck on start interval)

You would set the start interval then on a very short period and the normal interval to an extremely large number to disable completely.

If that is what would solve your case I would prefer this implementation to setting the wait timeout just to a longer value.

What Docker Engine version do you have in your test setup for instance? I am currently unsure about which versions are rolled out to Linux package registries ...

Our Ubuntu 22.04 machine on the cluster has 26.1.3

davidkopp commented 4 months ago

That's good to know and using start-interval would be definitely a preferable way of using the healthcheck during startup. On my Windows machine Docker is running in version 26.1.1.

However, only the implementation of start-interval would not solve the issue targeted by this pull request. Also the config wait_time_dependencies has to be changed (or the implementation that uses it). With the current implementation, GMT waits at maximum the configured amount of time for a dependent service until it should be healthy, independently of the defined interval (or start-interval) in the service definition. For example, if wait_time_dependencies is set to 20 seconds (currently the default) and interval is set to 30 seconds, GMT will raise an exception that the service was not healthy after 20 seconds.

For a quick improvement I think it makes sense to increase the wait_time_dependencies to e.g. 60 seconds. As another improvement it could be set to the value defined in interval or start-interval, if it is defined and higher than the configured value in wait_time_dependencies.

For reference: I have introduced this config parameter wait_time_dependencies in https://github.com/green-coding-solutions/green-metrics-tool/pull/568.

ArneTR commented 4 months ago

I do agree this needs a bigger PR to implement the functionality. What I see as a possible solution is that GMT either waits for wait_time_dependencies for a fixed time or really uses the healthcheck mechanism with the given retries and health-cmd.

I think what you are really asking for now is that we increase this limit on our configuration for the cluster, right? This does not need a PR as it is a configuration option anyway. Although I do see the case of making 60 the default.

So two things:

Which machine are you currently using and where should we increase this limit?
Can you also open an Issue or maybe a draft PR if you already have some healthcheck code for start-interval so we can keep track of this functionality?

Will merge this PR after

davidkopp commented 4 months ago

I have created the PR that implements the start-interval parameter: https://github.com/green-coding-solutions/green-metrics-tool/pull/783

I'm not sure yet if it's necessary to change the value for wait_time_dependencies on one of the machines in the cluster. If I would need it, it would be the machine "Fujitsu Esprimo P956 (Blue Angel compatible)".

ArneTR commented 4 months ago

I will merge this for now as the reasoning is very sound.

I will not increase it on th cluster for now. But if you need it let me know.

For anybody else reading this PR, please make a comment, so we know it is relevant for you

green-coding-solutions / green-metrics-tool

Proposal: Increase default wait time for dependencies #781