This bug is caused when the pool reuse an internal-worker(worker process) which is going to be terminated because a kill signal is already sent to the internal-worker due to a job cancelling. The operating system take some time to kill a process when a kill signal is sent.
The 'Bad frame from worker' is when the controller try to read from that internal-worker(released and being terminated), in the previous use some part of internal-worker output is read.
There's also a current fail because of Lwt_io.Channel_closed. the the controller also start reading when the internal-worker is killed by the OS and all the channels are closed.
The fix is about having different states of an internal-worker to prevent those bugs.
Some examples from OCaml-CI:
2023-06-09 11:57.17: Job failed: Error from solver: Failed: Bad frame from worker: time=" Rejected candidates:" len=" deployer.dev: Requires ocaml >= 4.13.0"
This bug is caused when the pool reuse an internal-worker(worker process) which is going to be terminated because a kill signal is already sent to the internal-worker due to a job cancelling. The operating system take some time to kill a process when a kill signal is sent.
The 'Bad frame from worker' is when the controller try to read from that internal-worker(released and being terminated), in the previous use some part of internal-worker output is read.
There's also a current fail because of Lwt_io.Channel_closed. the the controller also start reading when the internal-worker is killed by the OS and all the channels are closed.
The fix is about having different states of an internal-worker to prevent those bugs.
Some examples from OCaml-CI:
https://ocaml.ci.dev:8100/job/2023-06-09/113534-ci-analyse-74e6c8
https://ocaml.ci.dev:8100/job/2023-06-07/144726-ci-analyse-02c8c8