ocurrent / solver-service

An OCluster service for solving opam dependencies
Apache License 2.0
12 stars 7 forks source link

Fix bug: 'Bad frame from worker' and 'Channel_closed' #61

Closed moyodiallo closed 1 year ago

moyodiallo commented 1 year ago

This bug is caused when the pool reuse an internal-worker(worker process) which is going to be terminated because a kill signal is already sent to the internal-worker due to a job cancelling. The operating system take some time to kill a process when a kill signal is sent.

The 'Bad frame from worker' is when the controller try to read from that internal-worker(released and being terminated), in the previous use some part of internal-worker output is read.

There's also a current fail because of Lwt_io.Channel_closed. the the controller also start reading when the internal-worker is killed by the OS and all the channels are closed.

The fix is about having different states of an internal-worker to prevent those bugs.

Some examples from OCaml-CI:

2023-06-09 11:57.17: Job failed: Error from solver: Failed: Bad frame from worker: time="    Rejected candidates:" len="      deployer.dev: Requires ocaml >= 4.13.0"

https://ocaml.ci.dev:8100/job/2023-06-09/113534-ci-analyse-74e6c8

Lwt_io.Channel_closed("output")
Lwt_io.Channel_closed("output")
2023-06-07 15:10.56: Job failed: Error from solver: Failed: Build failed

https://ocaml.ci.dev:8100/job/2023-06-07/144726-ci-analyse-02c8c8