habitat-sh / builder

Habitat Builder
Apache License 2.0
33 stars 35 forks source link

Workers disconnect after an unknown period of time #1530

Open smacfarlane opened 4 years ago

smacfarlane commented 4 years ago

After some amount of time, workers stop respond to new jobs. This has only been observed on Windows and Kernel2 workers.
The observed behavior is a job remains in the Dispatching state until the cfg.job_timeout period elapses and is then cancelled.
The worker is connected and we see heartbeats continue. It also remains present in metrics dashboard. Our heartbeat channel is separate from our job dispatch channel.

Currently, the remediation is to restart the builder-worker service on affected build nodes.

It appears that the zmq::ROUTER socket is no longer transmitting messages to the client. It is a known zmq pattern that if a client connects to a ROUTER socket, but does not send heartbeats, it may timeout and the server won't be able to reconnect. We suspect we need to send KEEPALIVES as described in https://zguide.zeromq.org/docs/chapter4/#Heartbeating to keep the channel alive.

An alternate implementation would be to send jobs to workers to keep them alive. The downside is that we don't know how frequently we would need to dispatch to keep them alive.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

mwrock commented 2 years ago

Jobsrv reports:

Dec 02 21:19:29 ip-10-0-0-100 hab[596]: builder-jobsrv.acceptance(O): [2021-12-02T21:19:29Z WARN  habitat_builder_jobsrv::server::worker_manager] Failed to dispatch job to worker 3967@ip-10-0-0-192, err=Zmq(Host unreachable)
pozsgaic commented 2 years ago

This problem can be created on a linux target as well.

pozsgaic commented 2 years ago

Note also that the builder database has a table 'busy_workers' that shows the active builder workers. When a worker instance goes down while in the busy state, its failure to send a heartbeat will result in this worker being removed from jobsrv and from the busy_workers table. The job will transition to a pending state where it will remain until a new worker for the target (e.g. x86_64-linux) becomes available or the job timeout (60 minutes default) is reached.

pozsgaic commented 2 years ago
pozsgaic commented 2 years ago

While rust does support setting up keep alive and this would result in the ROUTER socket in builder-jobsrv continuing to test for connectivity. What I am not clear on is how will we know when the client disconnects? We want to know immediately if a client is disconnected so we can ensure it is no longer in the builder-jobsrv worker list.

Using our heartbeats at the application layer we are sending from the builder-worker instance to the builder-jobsrv instance. The absence of heartbeats will result in a disconnected state and ultimately results in the worker instance being removed from the worker list. This is desirable because we will know within 1 heartbeat if we've lost a client connection.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.