Open smacfarlane opened 4 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.
Jobsrv reports:
Dec 02 21:19:29 ip-10-0-0-100 hab[596]: builder-jobsrv.acceptance(O): [2021-12-02T21:19:29Z WARN habitat_builder_jobsrv::server::worker_manager] Failed to dispatch job to worker 3967@ip-10-0-0-192, err=Zmq(Host unreachable)
This problem can be created on a linux target as well.
If you stop the builder-worker immediately after it sends its heartbeat (30s period by default), and then launch a job from builder, this problem can be created consistently. Worker was shut down cleanly with hab svc stop.
The job status transitions to CancelComplete after the job timeout (60 minutes) is reached.
The project status is set to Canceled
The group status is set to Canceled. It remained at 'Dispatching' until the timeout was reached.
One minor difference is that in this case the heartbeats stopped when we shut down the service. Despite this, the handling of the job was the same.
Note also that the builder database has a table 'busy_workers' that shows the active builder workers. When a worker instance goes down while in the busy state, its failure to send a heartbeat will result in this worker being removed from jobsrv and from the busy_workers table. The job will transition to a pending state where it will remain until a new worker for the target (e.g. x86_64-linux) becomes available or the job timeout (60 minutes default) is reached.
set_tcp_keepalive
call. While rust does support setting up keep alive and this would result in the ROUTER socket in builder-jobsrv continuing to test for connectivity. What I am not clear on is how will we know when the client disconnects? We want to know immediately if a client is disconnected so we can ensure it is no longer in the builder-jobsrv worker list.
Using our heartbeats at the application layer we are sending from the builder-worker instance to the builder-jobsrv instance. The absence of heartbeats will result in a disconnected state and ultimately results in the worker instance being removed from the worker list. This is desirable because we will know within 1 heartbeat if we've lost a client connection.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.
After some amount of time, workers stop respond to new jobs. This has only been observed on Windows and Kernel2 workers.
The observed behavior is a job remains in the
Dispatching
state until thecfg.job_timeout
period elapses and is then cancelled.The worker is connected and we see heartbeats continue. It also remains present in metrics dashboard. Our heartbeat channel is separate from our job dispatch channel.
Currently, the remediation is to restart the
builder-worker
service on affected build nodes.It appears that the zmq::ROUTER socket is no longer transmitting messages to the client. It is a known zmq pattern that if a client connects to a ROUTER socket, but does not send heartbeats, it may timeout and the server won't be able to reconnect. We suspect we need to send KEEPALIVES as described in https://zguide.zeromq.org/docs/chapter4/#Heartbeating to keep the channel alive.
An alternate implementation would be to send jobs to workers to keep them alive. The downside is that we don't know how frequently we would need to dispatch to keep them alive.