Investigate and fix crashes in CHT Sync

mrjones-plip commented 3 months ago

~~There's been two separate instances reported that when CHT Sync was turned on, the CHT Core instance that couch2pg runs against experienced 500 errors. We should investigate this and:~~

Re-appropriating this ticket, but keeping former content so there's continuity of the ticket. This turned out to be crashes in CHT sync and no crashes were found in CHT Core. So, with that in mind, we should:

~~see if they're indeed caused by couch2pg~~
find out how to reproduce them in a development scenario
see if we can prevent them.

dianabarsan commented 3 months ago

It looks like the instances worked as normal and there were no crashes.

I'm suspecting cht-sync is crashing because requests to the cht instances fail - and it's setup to crash and recover in case this happens - so this is expected behavior.

As of now, I don't think anyone has logs from the restarted cht-sync containers, but I did inspect some of the instances in question and found no errors or downtime.

EarthmanAman commented 3 months ago

Sample error logs

/node_modules/node-fetch/lib/index.js:273
                return Body.Promise.reject(new FetchError(`invalid json response body at ${_this2.url} reason: ${err.message}`, 'invalid-json'));
                                           ^
FetchError: invalid json response body at https://chis-training-3.health.go.ke/medic/_changes?seq_interval=1000&since=2254464-g1AAAAO1eJyV069PxDAUB_ByI8FgIAGLJ4GsW9duCjyKu-sf0NdtWZbjl8BgcPwH_AEY2DX8BwgMGncKhyIXJJJdebTIC0lT8U3avE-a99oJIWS9iUqyoc8udVPCIU3EfoyLTvBooAhsSTlumwhI9PV2gntriokMNF9W84_060jYNWb6h306LGcV5QkLxdAxcGDtwmPvvcMyoVQOwRg6FvB2I491dw7jteKMZaEYOmNojek8djH0PYOCp7oOxdCZwrW1vcc2jxxW0xRYUoVi6CxOV8lQylvHDV4-HBdXcaF1GcihNEIJ4wFn4cGbcwcyyquEhY4BpQ4ljG9rnz24t-NAnWsog-eKUo8SxrGUMweuvD45kIKmQqtwEJs3w7g3Zu7Bx20HFqpglQj9E755cwx8gdaDV9z3EERWpGJZafsDf2wFDg&limit=25 reason: Unexpected token '<', "<html>
<h"... is not valid JSON
    at /node_modules/node-fetch/lib/index.js:273:32
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async fetchJSON (/node_modules/pouchdb-adapter-http/lib/index.js:246:18)
    at async fetchData (/node_modules/pouchdb-adapter-http/lib/index.js:1029:24) {
  type: 'invalid-json'
}
Node.js v20.16.0

mrjones-plip commented 3 months ago

Thanks @EarthmanAman for the update!

I've updated this ticket to be about crashes in CHT Sync as that's where it sounds like the crashes are, both from @EarthmanAman and from internal reports.

dianabarsan commented 3 months ago

I'm logged into echistraining3 instance and I'm seeing lots of requests to get 0 changes starting from 0, and some less frequent requests to get 0 changes starting from a different sequence.

2024-08-13T13:22:04.378 REQ: 97343f0e-c710-4fff-94ba-1482b0c581c0 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:04.391 RES: 97343f0e-c710-4fff-94ba-1482b0c581c0 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 12.697 ms
2024-08-13T13:22:09.854 REQ: 9a267de5-7d16-422c-b3c7-bc76229a3890 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:09.870 RES: 9a267de5-7d16-422c-b3c7-bc76229a3890 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 14.911 ms
2024-08-13T13:22:15.334 REQ: 3e2502ec-3c2a-476e-8768-2f8066b4cf2e 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:15.351 RES: 3e2502ec-3c2a-476e-8768-2f8066b4cf2e 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 16.170 ms
2024-08-13T13:22:20.821 REQ: a125b77b-f096-4ab5-9740-1b72d20e2a3d 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:20.836 RES: a125b77b-f096-4ab5-9740-1b72d20e2a3d 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 14.416 ms
2024-08-13T13:22:26.299 REQ: 8a05f202-6241-4257-806e-cfd63f128ece 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:26.315 RES: 8a05f202-6241-4257-806e-cfd63f128ece 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 14.392 ms
2024-08-13T13:22:31.779 REQ: 1dcdd947-103e-490c-8d37-4872843279de 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:31.795 RES: 1dcdd947-103e-490c-8d37-4872843279de 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 14.026 ms
2024-08-13T13:22:37.262 REQ: f33644b0-a217-4964-88ce-5a6a612b12b0 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:37.279 RES: f33644b0-a217-4964-88ce-5a6a612b12b0 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 16.202 ms
2024-08-13T13:26:03.604 REQ: b9a33e12-b4c9-4b7f-8c5b-bf2e42277d82 10.127.105.170 medic GET /medic/_changes?limit=0&since=3143994-g1AAAAO1eJyV0zFOwzAUBmCrqQQDS4dunIABxYljJxPcAEFridXPSZRGBRaYWbhCjwC0vgQSd-glKjEjJBL3YWeskKwMvxTLn57fs5eEkJMmKslEPzzppoRLmojzGD-6xKWRIjCVct42EZDIft3hvyPFRAaaH9rzj_TnSDgzZjNgVw7LWUV5wkIxdAxcWNs7bDw6dVgmlMohGEPHAlY3GyrzGK8VZywLxdCZQ2vM2ldGbn3PoOCprkMxdDbwbG3nsejTYTVNgSVVKIZOfz8mN1KuhoNOHBdXcaF1GcihNEMJ4x1n4cHu0YGM8iphoWNAaY0Sxq-1H_7AxwsH6lxDGTxXlDqUMK6l3PoK-1cHUtBUaBUOYvO2GG_G7Dz4s3JgoQpWidA34Zu3w8AbaD34_eJ7CCIrUnFoa7sHsIIEaQ HTTP/1.0

These seem to be coming from this 3.11.14.202 and from 10.127.105.170. Can someone check if this is the IP that hosts cht-sync? Are there two cht-syncs running? cht-sync has a BATCH_SIZE property with is set to 1000 by default, but can be set through an environment variable.

Is there a way I can get access to the server that runs cht-sync or is there a way to see the setup of cht-sync?

witash commented 3 months ago

Those are from the cht-sync thats in the dev EKS which is using a different image for couch2pg for debugging and is doing some wierd things. But I didn't get any useful any useful information from it so I'll stop it.

dianabarsan commented 3 months ago

A different image? Which branch?

dianabarsan commented 3 months ago

@witash Can you confirm that these services were stopped?

witash commented 3 months ago

yes

witash commented 3 months ago

here is the nginx log for the error message above.

chis-training-3-api-access.log.5:10.127.105.170 - - [09/Aug/2024:09:11:41 +0000] "GET /medic/_changes?seq_interval=1000&since=2254464-g1AAAAO1eJyV069PxDAUB_ByI8FgIAGLJ4GsW9duCjyKu-sf0NdtWZbjl8BgcPwH_AEY2DX8BwgMGncKhyIXJJJdebTIC0lT8U3avE-a99oJIWS9iUqyoc8udVPCIU3EfoyLTvBooAhsSTlumwhI9PV2gntriokMNF9W84_060jYNWb6h306LGcV5QkLxdAxcGDtwmPvvcMyoVQOwRg6FvB2I491dw7jteKMZaEYOmNojek8djH0PYOCp7oOxdCZwrW1vcc2jxxW0xRYUoVi6CxOV8lQylvHDV4-HBdXcaF1GcihNEIJ4wFn4cGbcwcyyquEhY4BpQ4ljG9rnz24t-NAnWsog-eKUo8SxrGUMweuvD45kIKmQqtwEJs3w7g3Zu7Bx20HFqpglQj9E755cwx8gdaDV9z3EERWpGJZafsDf2wFDg&limit=25 HTTP/1.1" 504 160 "-" "node-fetch/1.0 (+https://github.com/bitinn/node-fetch)"

so nginx returns 504 gateway timeout.

First guess would be that cht is taking way too long for these requests and timing out, but I can see that proxy_read_timeout is set to 3600s which is extremely long, basically no timeout. So next guess would be its hitting proxy_connection_timeout which is by default 60s, and nginx is not able to even get a connection withing 60s (cht instance is totally down)

But I don't see anything out of the ordinary in the cht api or haprxoy logs during the same period that would suggest the api is down and connections would be timing out

witash commented 3 months ago

There was a similar pattern on the nairobi instance nginx logs when cht-sync was running; a small percentage of cht-sync requests fail with 504 But interestingly, a small percentage of the other requests (from normal clients) to the changes feed also return 504. maybe not the same thing because those are long polling, and could also be just that the changes feed is very common so its a general issue but the changes feed is overrepresented in the logs but it does suggest that it's something to do with the changes feed but its not related to cht-sync requests specifically

witash commented 3 months ago

and from the error logs

chis-training-3-api.error.log.4:2024/08/09 09:11:41 [error] 315099#315099: *51989785 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.127.105.170, server: chis-training-3.health.go.ke, request: "GET /medic/_changes?seq_interval=1000&since=2254464-g1AAAAO1eJyV069PxDAUB_ByI8FgIAGLJ4GsW9duCjyKu-sf0NdtWZbjl8BgcPwH_AEY2DX8BwgMGncKhyIXJJJdebTIC0lT8U3avE-a99oJIWS9iUqyoc8udVPCIU3EfoyLTvBooAhsSTlumwhI9PV2gntriokMNF9W84_060jYNWb6h306LGcV5QkLxdAxcGDtwmPvvcMyoVQOwRg6FvB2I491dw7jteKMZaEYOmNojek8djH0PYOCp7oOxdCZwrW1vcc2jxxW0xRYUoVi6CxOV8lQylvHDV4-HBdXcaF1GcihNEIJ4wFn4cGbcwcyyquEhY4BpQ4ljG9rnz24t-NAnWsog-eKUo8SxrGUMweuvD45kIKmQqtwEJs3w7g3Zu7Bx20HFqpglQj9E755cwx8gdaDV9z3EERWpGJZafsDf2wFDg&limit=25 HTTP/1.1", upstream: "https://10.127.105.119:443/medic/_changes?seq_interval=1000&since=2254464-g1AAAAO1eJyV069PxDAUB_ByI8FgIAGLJ4GsW9duCjyKu-sf0NdtWZbjl8BgcPwH_AEY2DX8BwgMGncKhyIXJJJdebTIC0lT8U3avE-a99oJIWS9iUqyoc8udVPCIU3EfoyLTvBooAhsSTlumwhI9PV2gntriokMNF9W84_060jYNWb6h306LGcV5QkLxdAxcGDtwmPvvcMyoVQOwRg6FvB2I491dw7jteKMZaEYOmNojek8djH0PYOCp7oOxdCZwrW1vcc2jxxW0xRYUoVi6CxOV8lQylvHDV4-HBdXcaF1GcihNEIJ4wFn4cGbcwcyyquEhY4BpQ4ljG9rnz24t-NAnWsog-eKUo8SxrGUMweuvD45kIKmQqtwEJs3w7g3Zu7Bx20HFqpglQj9E755cwx8gdaDV9z3EERWpGJZafsDf2wFDg&limit=25", host: "chis-training-3.health.go.ke"

from the nairobi error logs (and other training logs), there are some similiar upstream timed out (110: Connection timed out) while connecting to upstream This message definelty means the timeout is getting the connection, not that upstream is taking too long to process the request.

but also there are some message with (24: Too many open files) like

2024/08/07 18:35:28 [alert] 191941#191941: *35631533 socket() failed (24: Too many open files) while connecting to upstream, client: 10.127.105.170, server: nairobi-echis.health.go.ke, request: "GET /medic/_changes?seq_interval=1000&since=2083664-g1AAAAO1eJyV0zFLAzEUB_DYKwjioIIfwNVBLne5S2_SL-CgbRzc8nJ3HEetk5sg-B1cXNy0zeYncFa63mcQ6uxmG595axFChj8k5EfyXjJmjG03Ucl2zfWNaUo44Yk8inHwMS71NIN9pUZtEwGLnt6vcG5TC5mBydft-Uf6cxQcWjsj7H7PYwNR8TwRoRg6Fo6dWxEWX3gsk1oPIBhDxwGebkhY_9Njea1zIbJQDJ0RtNZOPdb7pmtqKPLU1KEYOjO4c25J2OOrx2qegkiqUAyd1aTPzpV6IG5y6bm4igtjykAOpSFKGC_YCwJPPzwoeF4lIrQNKE1Rwvhx7o3Ag86DZmCgDO4rSkuUMM6U6gjcmXuQg-HS6HAQi9dhPFu7IHBLeLDQhahk6J-g4i0w8AU6D2583VINQWZFKtdtbX8BLcQCQw&limit=25 HTTP/1.1", upstream: "https://10.127.105.123:443/medic/_changes?seq_interval=1000&since=2083664-g1AAAAO1eJyV0zFLAzEUB_DYKwjioIIfwNVBLne5S2_SL-CgbRzc8nJ3HEetk5sg-B1cXNy0zeYncFa63mcQ6uxmG595axFChj8k5EfyXjJmjG03Ucl2zfWNaUo44Yk8inHwMS71NIN9pUZtEwGLnt6vcG5TC5mBydft-Uf6cxQcWjsj7H7PYwNR8TwRoRg6Fo6dWxEWX3gsk1oPIBhDxwGebkhY_9Njea1zIbJQDJ0RtNZOPdb7pmtqKPLU1KEYOjO4c25J2OOrx2qegkiqUAyd1aTPzpV6IG5y6bm4igtjykAOpSFKGC_YCwJPPzwoeF4lIrQNKE1Rwvhx7o3Ag86DZmCgDO4rSkuUMM6U6gjcmXuQg-HS6HAQi9dhPFu7IHBLeLDQhahk6J-g4i0w8AU6D2583VINQWZFKtdtbX8BLcQCQw&limit=25", host: "nairobi.echis.go.ke"

dianabarsan commented 3 months ago

Looking at the state of the pods now:

cht-sync-couch2pg--chis-training-2-health-go-ke-547d647449gkhwb   1/1     Running   10 (2d20h ago)   6d5h
cht-sync-couch2pg--chis-training-3-health-go-ke-74cfb748f78whkn   1/1     Running   10 (2d20h ago)   6d5h
cht-sync-couch2pg--chis-training-health-go-ke-857f8b8ff-5lh94     1/1     Running   32 (2d20h ago)   6d5h

Looking at last crashes:

2024-08-12T17:07:48.609807241Z                         ^
2024-08-12T17:07:48.610167246Z FetchError: request to https://chis-training-2.health.go.ke/_session failed, reason: connect ECONNREFUSED 197.248.202.213:443
2024-08-12T17:07:48.610183093Z     at ClientRequest.<anonymous> (/node_modules/node-fetch/lib/index.js:1491:11)
2024-08-12T17:07:48.610187448Z     at ClientRequest.emit (node:events:519:28)
2024-08-12T17:07:48.610190860Z     at TLSSocket.socketErrorListener (node:_http_client:500:9)
2024-08-12T17:07:48.610194018Z     at TLSSocket.emit (node:events:519:28)
2024-08-12T17:07:48.610197835Z     at emitErrorNT (node:internal/streams/destroy:169:8)
2024-08-12T17:07:48.610204255Z     at emitErrorCloseNT (node:internal/streams/destroy:128:3)
2024-08-12T17:07:48.610210655Z     at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
2024-08-12T17:07:48.610216407Z   type: 'system',
2024-08-12T17:07:48.610220589Z   errno: 'ECONNREFUSED',
2024-08-12T17:07:48.610224525Z   code: 'ECONNREFUSED'
2024-08-12T17:07:48.610228708Z }
2024-08-12T17:07:48.610232385Z 
2024-08-12T17:07:48.610236046Z Node.js v20.16.0

2024-08-12T17:07:53.705129905Z FetchError: request to https://chis-training-3.health.go.ke/_session failed, reason: connect ECONNREFUSED 197.248.202.213:443
2024-08-12T17:07:53.705147556Z     at ClientRequest.<anonymous> (/node_modules/node-fetch/lib/index.js:1491:11)
2024-08-12T17:07:53.705152027Z     at ClientRequest.emit (node:events:519:28)
2024-08-12T17:07:53.705155295Z     at TLSSocket.socketErrorListener (node:_http_client:500:9)
2024-08-12T17:07:53.705160004Z     at TLSSocket.emit (node:events:519:28)
2024-08-12T17:07:53.705164463Z     at emitErrorNT (node:internal/streams/destroy:169:8)
2024-08-12T17:07:53.705168002Z     at emitErrorCloseNT (node:internal/streams/destroy:128:3)
2024-08-12T17:07:53.705171086Z     at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
2024-08-12T17:07:53.705174122Z   type: 'system',
2024-08-12T17:07:53.705177043Z   errno: 'ECONNREFUSED',
2024-08-12T17:07:53.705180090Z   code: 'ECONNREFUSED'
2024-08-12T17:07:53.705183077Z }
2024-08-12T17:07:53.705187603Z 
2024-08-12T17:07:53.705192001Z Node.js v20.16.0

2024-08-12T17:07:25.601083819Z /node_modules/node-fetch/lib/index.js:1491
2024-08-12T17:07:25.601091706Z                  reject(new FetchError(`request to ${request.url} failed, reason: ${err.message}`, 'system', err));
2024-08-12T17:07:25.601098024Z                         ^
2024-08-12T17:07:25.601417868Z FetchError: request to https://chis-training.health.go.ke/_session failed, reason: connect ECONNREFUSED 197.248.202.213:443
2024-08-12T17:07:25.601446968Z     at ClientRequest.<anonymous> (/node_modules/node-fetch/lib/index.js:1491:11)
2024-08-12T17:07:25.601452173Z     at ClientRequest.emit (node:events:519:28)
2024-08-12T17:07:25.601455493Z     at TLSSocket.socketErrorListener (node:_http_client:500:9)
2024-08-12T17:07:25.601458974Z     at TLSSocket.emit (node:events:519:28)
2024-08-12T17:07:25.601462005Z     at emitErrorNT (node:internal/streams/destroy:169:8)
2024-08-12T17:07:25.601465135Z     at emitErrorCloseNT (node:internal/streams/destroy:128:3)
2024-08-12T17:07:25.601468083Z     at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
2024-08-12T17:07:25.601471116Z   type: 'system',
2024-08-12T17:07:25.601475576Z   errno: 'ECONNREFUSED',
2024-08-12T17:07:25.601479960Z   code: 'ECONNREFUSED'
2024-08-12T17:07:25.601483288Z }
2024-08-12T17:07:25.601486301Z 
2024-08-12T17:07:25.601489156Z Node.js v20.16.0

Unfortunately I can't get logs for previous pods (thanks k8s), but it seems like these last crashes, that happened 3 days ago, had a real connectivity underlying problem, since they all happen at toughly the same time.

At the moment, the cht-sync instances are stable and are up to date. If we want to further investigate this behavior, we could create a new postgres database (temporarily) and sync one of those training instances again, and watch. I am still suspecting some higher level networking timeout issue, rather than either a cht-sync or cht issue.

witash commented 3 months ago

connection refused could correspond to this message in the nginx logs:

2024/08/07 18:35:28 [alert] 191941#191941: *35631533 socket() failed (24: Too many open files) while connecting to upstream, client: 10.127.105.170, server: nairobi-echis.health.go.ke, request: "GET /medic/_changes?seq_interval=1000&since=2083664-g1AAAAO1eJyV0zFLAzEUB_DYKwjioIIfwNVBLne5S2_SL-CgbRzc8nJ3HEetk5sg-B1cXNy0zeYncFa63mcQ6uxmG595axFChj8k5EfyXjJmjG03Ucl2zfWNaUo44Yk8inHwMS71NIN9pUZtEwGLnt6vcG5TC5mBydft-Uf6cxQcWjsj7H7PYwNR8TwRoRg6Fo6dWxEWX3gsk1oPIBhDxwGebkhY_9Njea1zIbJQDJ0RtNZOPdb7pmtqKPLU1KEYOjO4c25J2OOrx2qegkiqUAyd1aTPzpV6IG5y6bm4igtjykAOpSFKGC_YCwJPPzwoeF4lIrQNKE1Rwvhx7o3Ag86DZmCgDO4rSkuUMM6U6gjcmXuQg-HS6HAQi9dhPFu7IHBLeLDQhahk6J-g4i0w8AU6D2583VINQWZFKtdtbX8BLcQCQw&limit=25 HTTP/1.1", upstream: "https://10.127.105.123:443/medic/_changes?seq_interval=1000&since=2083664-g1AAAAO1eJyV0zFLAzEUB_DYKwjioIIfwNVBLne5S2_SL-CgbRzc8nJ3HEetk5sg-B1cXNy0zeYncFa63mcQ6uxmG595axFChj8k5EfyXjJmjG03Ucl2zfWNaUo44Yk8inHwMS71NIN9pUZtEwGLnt6vcG5TC5mBydft-Uf6cxQcWjsj7H7PYwNR8TwRoRg6Fo6dWxEWX3gsk1oPIBhDxwGebkhY_9Njea1zIbJQDJ0RtNZOPdb7pmtqKPLU1KEYOjO4c25J2OOrx2qegkiqUAyd1aTPzpV6IG5y6bm4igtjykAOpSFKGC_YCwJPPzwoeF4lIrQNKE1Rwvhx7o3Ag86DZmCgDO4rSkuUMM6U6gjcmXuQg-HS6HAQi9dhPFu7IHBLeLDQhahk6J-g4i0w8AU6D2583VINQWZFKtdtbX8BLcQCQw&limit=25", host: "nairobi.echis.go.ke"

so there's at least two issues

504 GW timeout caused by connection timeout between nginx and cht instances (very likely a network issue)
connection refused by nginx because too many files open (no idea why too many files open) plus maybe a third if connection refused is not related to 2.

andrablaj commented 2 months ago

Per yesterday's weekly sync, closing this issue as the error doesn't seem to be related to the CHT Sync.

medic / cht-sync

Investigate and fix crashes in CHT Sync #132