Closed Hareet closed 8 months ago
Thanks for submitting @Hareet . Do you have a rough idea of what "significant period" is? Can it be the node doesn't correctly rejoin the swarm? Does healthcheck "cache" dns like we've seen nginx / haproxy do?
I've just seen this in two live single-node instances:
ending response: Up (Couchdb cluster okay) ...
Closing connection ...
Client disconnected
Connected to:172.30.0.3:52282
Everything is fine
[b'up\n']
Sending response: Up (Couchdb cluster okay) ...
Closing connection ...
Client disconnected
Exception ignored in thread started by: <function threaded at 0x7f9aa453df30>
Traceback (most recent call last):
File "/app/check.py", line 61, in threaded
send_down_response(conn)
File "/app/check.py", line 23, in send_down_response
conn.send(b'down\n')
ConnectionResetError: [Errno 104] Connection reset by peer
CouchDb was up, healthcheck was no longer queried by haproxy, but was running. Restarting haproxy and healthcheck fixed the issue in both instances.
@nydr With #8813 merged, can this be closed now?
@garethbowen I have not been able to reproduce this or similar issues with the new version @Hareet could you see if you're able to reproduce the issue with the main branch of cht-core?
Just checking in on this issue as it is one of the 4 open issues remaining for the 4.6.0
Milestone. Can we close this out? Should we just drop it from the milestone?
The code is merged. Let's close this and reopen if we hit it again.
Describe the bug cht-healtcheck stays down after running into a ConnectionResetError when couchdb.1 goes down for an extended period of time. This trickles down to cht-haproxy thinking the couchdb cluster is down, and telling API there is no connection to a server.
To Reproduce I am seeing this twice in recent production downtime for medic-hosted projects. You can ping me for which specific projects. I'll try to recreate it locally, time permitting, but i suspect the steps to reproduce would be :
Expected behavior cht-healtcheck should continue trying to connect, or be more explicit that the container is crashing.
Logs
Environment