Workers disconnect after an unknown period of time

smacfarlane commented 4 years ago

After some amount of time, workers stop respond to new jobs. This has only been observed on Windows and Kernel2 workers.
The observed behavior is a job remains in the Dispatching state until the cfg.job_timeout period elapses and is then cancelled.
The worker is connected and we see heartbeats continue. It also remains present in metrics dashboard. Our heartbeat channel is separate from our job dispatch channel.

Currently, the remediation is to restart the builder-worker service on affected build nodes.

It appears that the zmq::ROUTER socket is no longer transmitting messages to the client. It is a known zmq pattern that if a client connects to a ROUTER socket, but does not send heartbeats, it may timeout and the server won't be able to reconnect. We suspect we need to send KEEPALIVES as described in https://zguide.zeromq.org/docs/chapter4/#Heartbeating to keep the channel alive.

An alternate implementation would be to send jobs to workers to keep them alive. The downside is that we don't know how frequently we would need to dispatch to keep them alive.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

mwrock commented 2 years ago

Jobsrv reports:

Dec 02 21:19:29 ip-10-0-0-100 hab[596]: builder-jobsrv.acceptance(O): [2021-12-02T21:19:29Z WARN  habitat_builder_jobsrv::server::worker_manager] Failed to dispatch job to worker 3967@ip-10-0-0-192, err=Zmq(Host unreachable)

pozsgaic commented 2 years ago

This problem can be created on a linux target as well.

If you stop the builder-worker immediately after it sends its heartbeat (30s period by default), and then launch a job from builder, this problem can be created consistently. Worker was shut down cleanly with hab svc stop.
The job status transitions to CancelComplete after the job timeout (60 minutes) is reached.
The project status is set to Canceled
The group status is set to Canceled. It remained at 'Dispatching' until the timeout was reached.
One minor difference is that in this case the heartbeats stopped when we shut down the service. Despite this, the handling of the job was the same.

pozsgaic commented 2 years ago

Forcing down the builder-worker with "sudo kill -9" exhibits different behavior because the hab supervisor will restart the service when it does not get shut down cleanly.
The job does get submitted successfully and is in a Pending state, assigned to the original worker that we forced down.
The job get dispatched correctly to the new worker that was restarted by hab supervisor and builds successfully. The worker field gets changed to the new worker that has picked up the job.

Note also that the builder database has a table 'busy_workers' that shows the active builder workers. When a worker instance goes down while in the busy state, its failure to send a heartbeat will result in this worker being removed from jobsrv and from the busy_workers table. The job will transition to a pending state where it will remain until a new worker for the target (e.g. x86_64-linux) becomes available or the job timeout (60 minutes default) is reached.

pozsgaic commented 2 years ago

It is not clear how long it will take before a worker stops responding to a job request. It is seemingly a long time because this issue is fairly rare and was not reproducible with sample code of a ROUTER socker server with two DEALER sockets connected. These apps did not lose connection for the many hours they ran, and this is without keep alive configured in the ROUTER end.
According to the zmq online docs, we want to either maintain a heartbeat over the same channel we send data or set the socket to keep alive when we establish our listener ROUTER socket in builder-jobsrv. We currently maintain connections to the jobsrv instances in both the heartbeat manager and the main server in builder-worker.
If we want to go the heartbeat route, it would be best to remove the heartbeat socket and have the heartbeats go over the job dispatch channel. Then if we have a heartbeat timeout we receive it in the job dispatch channel and remove the worker. It will not appear in the worker list until jobsrv receives a new heartbeat. Also, we would want to ensure the heartbeats are of lower priority if we move to the job dispatch channel so as not to interfere with the jobs. Also we would count a job status message as a successful heartbeat and advance the timeout on reception.
If we want to go the keep alive route, then we would establish when we create the ROUTER socket with the set_tcp_keepalive call.

pozsgaic commented 2 years ago

While rust does support setting up keep alive and this would result in the ROUTER socket in builder-jobsrv continuing to test for connectivity. What I am not clear on is how will we know when the client disconnects? We want to know immediately if a client is disconnected so we can ensure it is no longer in the builder-jobsrv worker list.

Using our heartbeats at the application layer we are sending from the builder-worker instance to the builder-jobsrv instance. The absence of heartbeats will result in a disconnected state and ultimately results in the worker instance being removed from the worker list. This is desirable because we will know within 1 heartbeat if we've lost a client connection.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

habitat-sh / builder

Workers disconnect after an unknown period of time #1530