cht-healthcheck stays down after ConnectionRestError

Hareet commented 12 months ago

Describe the bug cht-healtcheck stays down after running into a ConnectionResetError when couchdb.1 goes down for an extended period of time. This trickles down to cht-haproxy thinking the couchdb cluster is down, and telling API there is no connection to a server.

To Reproduce I am seeing this twice in recent production downtime for medic-hosted projects. You can ping me for which specific projects. I'll try to recreate it locally, time permitting, but i suspect the steps to reproduce would be :

Take down couchdb.1 for a while
cht-healthcheck should run into the error in the logs below

Expected behavior cht-healtcheck should continue trying to connect, or be more explicit that the container is crashing.

Logs

Sending response: Up (Couchdb cluster okay) ...
Closing connection ...
Client disconnected
Exception ignored in thread started by: <function threaded at 0x7fa6d54fdea0>
Traceback (most recent call last):
  File "/app/check.py", line 63, in threaded
    send_up_response(conn)
  File "/app/check.py", line 28, in send_up_response
    conn.send(b'up\n')
ConnectionResetError: [Errno 104] Connection reset by peer

Environment

Instance: not sure if i should put it here, so please ping me if it matters
App: cht-healthcheck
Version: 4.1.0 multi-couchdb on distributed nodes running docker compose in a swarm network

dianabarsan commented 12 months ago

Thanks for submitting @Hareet . Do you have a rough idea of what "significant period" is? Can it be the node doesn't correctly rejoin the swarm? Does healthcheck "cache" dns like we've seen nginx / haproxy do?

dianabarsan commented 10 months ago

I've just seen this in two live single-node instances:

ending response: Up (Couchdb cluster okay) ...
Closing connection ...
Client disconnected
Connected to:172.30.0.3:52282
Everything is fine
[b'up\n']
Sending response: Up (Couchdb cluster okay) ...
Closing connection ...
Client disconnected
Exception ignored in thread started by: <function threaded at 0x7f9aa453df30>
Traceback (most recent call last):
  File "/app/check.py", line 61, in threaded
    send_down_response(conn)
  File "/app/check.py", line 23, in send_down_response
    conn.send(b'down\n')
ConnectionResetError: [Errno 104] Connection reset by peer

CouchDb was up, healthcheck was no longer queried by haproxy, but was running. Restarting haproxy and healthcheck fixed the issue in both instances.

garethbowen commented 8 months ago

@nydr With #8813 merged, can this be closed now?

nydr commented 8 months ago

@garethbowen I have not been able to reproduce this or similar issues with the new version @Hareet could you see if you're able to reproduce the issue with the main branch of cht-core?

jkuester commented 8 months ago

Just checking in on this issue as it is one of the 4 open issues remaining for the 4.6.0 Milestone. Can we close this out? Should we just drop it from the milestone?

garethbowen commented 8 months ago

The code is merged. Let's close this and reopen if we hit it again.

medic / cht-core

cht-healthcheck stays down after ConnectionRestError #8644