lenra-io / server

GNU Affero General Public License v3.0
6 stars 0 forks source link

[Bug]: Kubernetes build status fetch error on server restart #562

Closed shiipou closed 3 months ago

shiipou commented 6 months ago

What happened?

When the server restart, it try to get back the build status to complete de build. But if the build didn't exist anymore on the kubernetes because it's finished (But notification didn't got retrieved by the server) the server throw an error at start for each build.

What browsers are you seeing the problem on?

No response

Version

1.5.3

Relevant log output

https://lenra-br.sentry.io/issues/4573629917/

Uncaught exit - {:noproc, {GenServer, :stop, [{:global, {Lenra.Kubernetes.Status, "591"}}, :normal, :infinity]}}
  File "lib/gen_server.ex", line 977, in GenServer.stop/3
  File "lib/lenra_web/controllers/runner_controller.ex", line 24, in LenraWeb.RunnerController.update_build/2
  File "lib/lenra_web/controllers/runner_controller.ex", line 1, in LenraWeb.RunnerController.action/2
  File "lib/lenra_web/controllers/runner_controller.ex", line 1, in LenraWeb.RunnerController.phoenix_controller_pipeline/2
  File "lib/phoenix/router.ex", line 354, in Phoenix.Router.__call__/2
  File "lib/lenra_web/endpoint.ex", line 1, in LenraWeb.Endpoint.plug_builder_call/2
  File "lib/lenra_web/endpoint.ex", line 1, in LenraWeb.Endpoint."call (overridable 3)"/2
  File "lib/lenra_web/endpoint.ex", line 1, in LenraWeb.Endpoint.call/2
taorepoara commented 6 months ago

This error is strange since Kubernetes job should not be removed directly.

We might define a duration after which the job is failed at the server start. use the same duration as build timeout.

taorepoara commented 6 months ago

Check if the deployment timeout covers this issue

jonas-martinez commented 6 months ago

I don't think that this is a problem with Kubernetes because if you go to the lib/lenra_web/controllers/runner_controller.ex line 24 as shown in your error above, you will see that this function is called by Kubernetes from the /runner/build/:id API endpoint and that the server properly updates the build status on its database.

This means that only the Kubernetes.Status GenServer crashed or stopped prematurely and that we just need to ignore the error on the server when trying to stop this GenServer when it is not running.

Please see my PR to fix this issue, and don't hesitate to read the PR description for more information.

https://github.com/lenra-io/server/pull/564