Closed mrjones-plip closed 2 months ago
It looks like the instances worked as normal and there were no crashes.
I'm suspecting cht-sync is crashing because requests to the cht instances fail - and it's setup to crash and recover in case this happens - so this is expected behavior.
As of now, I don't think anyone has logs from the restarted cht-sync containers, but I did inspect some of the instances in question and found no errors or downtime.
Sample error logs
/node_modules/node-fetch/lib/index.js:273
return Body.Promise.reject(new FetchError(`invalid json response body at ${_this2.url} reason: ${err.message}`, 'invalid-json'));
^
FetchError: invalid json response body at https://chis-training-3.health.go.ke/medic/_changes?seq_interval=1000&since=2254464-g1AAAAO1eJyV069PxDAUB_ByI8FgIAGLJ4GsW9duCjyKu-sf0NdtWZbjl8BgcPwH_AEY2DX8BwgMGncKhyIXJJJdebTIC0lT8U3avE-a99oJIWS9iUqyoc8udVPCIU3EfoyLTvBooAhsSTlumwhI9PV2gntriokMNF9W84_060jYNWb6h306LGcV5QkLxdAxcGDtwmPvvcMyoVQOwRg6FvB2I491dw7jteKMZaEYOmNojek8djH0PYOCp7oOxdCZwrW1vcc2jxxW0xRYUoVi6CxOV8lQylvHDV4-HBdXcaF1GcihNEIJ4wFn4cGbcwcyyquEhY4BpQ4ljG9rnz24t-NAnWsog-eKUo8SxrGUMweuvD45kIKmQqtwEJs3w7g3Zu7Bx20HFqpglQj9E755cwx8gdaDV9z3EERWpGJZafsDf2wFDg&limit=25 reason: Unexpected token '<', "<html>
<h"... is not valid JSON
at /node_modules/node-fetch/lib/index.js:273:32
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async fetchJSON (/node_modules/pouchdb-adapter-http/lib/index.js:246:18)
at async fetchData (/node_modules/pouchdb-adapter-http/lib/index.js:1029:24) {
type: 'invalid-json'
}
Node.js v20.16.0
Thanks @EarthmanAman for the update!
I've updated this ticket to be about crashes in CHT Sync as that's where it sounds like the crashes are, both from @EarthmanAman and from internal reports.
I'm logged into echistraining3 instance and I'm seeing lots of requests to get 0 changes starting from 0, and some less frequent requests to get 0 changes starting from a different sequence.
2024-08-13T13:22:04.378 REQ: 97343f0e-c710-4fff-94ba-1482b0c581c0 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:04.391 RES: 97343f0e-c710-4fff-94ba-1482b0c581c0 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 12.697 ms
2024-08-13T13:22:09.854 REQ: 9a267de5-7d16-422c-b3c7-bc76229a3890 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:09.870 RES: 9a267de5-7d16-422c-b3c7-bc76229a3890 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 14.911 ms
2024-08-13T13:22:15.334 REQ: 3e2502ec-3c2a-476e-8768-2f8066b4cf2e 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:15.351 RES: 3e2502ec-3c2a-476e-8768-2f8066b4cf2e 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 16.170 ms
2024-08-13T13:22:20.821 REQ: a125b77b-f096-4ab5-9740-1b72d20e2a3d 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:20.836 RES: a125b77b-f096-4ab5-9740-1b72d20e2a3d 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 14.416 ms
2024-08-13T13:22:26.299 REQ: 8a05f202-6241-4257-806e-cfd63f128ece 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:26.315 RES: 8a05f202-6241-4257-806e-cfd63f128ece 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 14.392 ms
2024-08-13T13:22:31.779 REQ: 1dcdd947-103e-490c-8d37-4872843279de 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:31.795 RES: 1dcdd947-103e-490c-8d37-4872843279de 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 14.026 ms
2024-08-13T13:22:37.262 REQ: f33644b0-a217-4964-88ce-5a6a612b12b0 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0
2024-08-13T13:22:37.279 RES: f33644b0-a217-4964-88ce-5a6a612b12b0 3.11.14.202 medic GET /medic/_changes?limit=0&since=0 HTTP/1.0 200 - 16.202 ms
2024-08-13T13:26:03.604 REQ: b9a33e12-b4c9-4b7f-8c5b-bf2e42277d82 10.127.105.170 medic GET /medic/_changes?limit=0&since=3143994-g1AAAAO1eJyV0zFOwzAUBmCrqQQDS4dunIABxYljJxPcAEFridXPSZRGBRaYWbhCjwC0vgQSd-glKjEjJBL3YWeskKwMvxTLn57fs5eEkJMmKslEPzzppoRLmojzGD-6xKWRIjCVct42EZDIft3hvyPFRAaaH9rzj_TnSDgzZjNgVw7LWUV5wkIxdAxcWNs7bDw6dVgmlMohGEPHAlY3GyrzGK8VZywLxdCZQ2vM2ldGbn3PoOCprkMxdDbwbG3nsejTYTVNgSVVKIZOfz8mN1KuhoNOHBdXcaF1GcihNEMJ4x1n4cHu0YGM8iphoWNAaY0Sxq-1H_7AxwsH6lxDGTxXlDqUMK6l3PoK-1cHUtBUaBUOYvO2GG_G7Dz4s3JgoQpWidA34Zu3w8AbaD34_eJ7CCIrUnFoa7sHsIIEaQ HTTP/1.0
These seem to be coming from this 3.11.14.202 and from 10.127.105.170. Can someone check if this is the IP that hosts cht-sync? Are there two cht-syncs running? cht-sync has a BATCH_SIZE property with is set to 1000 by default, but can be set through an environment variable.
Is there a way I can get access to the server that runs cht-sync or is there a way to see the setup of cht-sync?
Those are from the cht-sync thats in the dev EKS which is using a different image for couch2pg for debugging and is doing some wierd things. But I didn't get any useful any useful information from it so I'll stop it.
A different image? Which branch?
@witash Can you confirm that these services were stopped?
yes
here is the nginx log for the error message above.
chis-training-3-api-access.log.5:10.127.105.170 - - [09/Aug/2024:09:11:41 +0000] "GET /medic/_changes?seq_interval=1000&since=2254464-g1AAAAO1eJyV069PxDAUB_ByI8FgIAGLJ4GsW9duCjyKu-sf0NdtWZbjl8BgcPwH_AEY2DX8BwgMGncKhyIXJJJdebTIC0lT8U3avE-a99oJIWS9iUqyoc8udVPCIU3EfoyLTvBooAhsSTlumwhI9PV2gntriokMNF9W84_060jYNWb6h306LGcV5QkLxdAxcGDtwmPvvcMyoVQOwRg6FvB2I491dw7jteKMZaEYOmNojek8djH0PYOCp7oOxdCZwrW1vcc2jxxW0xRYUoVi6CxOV8lQylvHDV4-HBdXcaF1GcihNEIJ4wFn4cGbcwcyyquEhY4BpQ4ljG9rnz24t-NAnWsog-eKUo8SxrGUMweuvD45kIKmQqtwEJs3w7g3Zu7Bx20HFqpglQj9E755cwx8gdaDV9z3EERWpGJZafsDf2wFDg&limit=25 HTTP/1.1" 504 160 "-" "node-fetch/1.0 (+https://github.com/bitinn/node-fetch)"
so nginx returns 504 gateway timeout.
First guess would be that cht is taking way too long for these requests and timing out, but I can see that proxy_read_timeout is set to 3600s which is extremely long, basically no timeout. So next guess would be its hitting proxy_connection_timeout which is by default 60s, and nginx is not able to even get a connection withing 60s (cht instance is totally down)
But I don't see anything out of the ordinary in the cht api or haprxoy logs during the same period that would suggest the api is down and connections would be timing out
There was a similar pattern on the nairobi instance nginx logs when cht-sync was running; a small percentage of cht-sync requests fail with 504 But interestingly, a small percentage of the other requests (from normal clients) to the changes feed also return 504. maybe not the same thing because those are long polling, and could also be just that the changes feed is very common so its a general issue but the changes feed is overrepresented in the logs but it does suggest that it's something to do with the changes feed but its not related to cht-sync requests specifically
and from the error logs
chis-training-3-api.error.log.4:2024/08/09 09:11:41 [error] 315099#315099: *51989785 upstream timed out (110: Connection timed out) while connecting to upstream, client: 10.127.105.170, server: chis-training-3.health.go.ke, request: "GET /medic/_changes?seq_interval=1000&since=2254464-g1AAAAO1eJyV069PxDAUB_ByI8FgIAGLJ4GsW9duCjyKu-sf0NdtWZbjl8BgcPwH_AEY2DX8BwgMGncKhyIXJJJdebTIC0lT8U3avE-a99oJIWS9iUqyoc8udVPCIU3EfoyLTvBooAhsSTlumwhI9PV2gntriokMNF9W84_060jYNWb6h306LGcV5QkLxdAxcGDtwmPvvcMyoVQOwRg6FvB2I491dw7jteKMZaEYOmNojek8djH0PYOCp7oOxdCZwrW1vcc2jxxW0xRYUoVi6CxOV8lQylvHDV4-HBdXcaF1GcihNEIJ4wFn4cGbcwcyyquEhY4BpQ4ljG9rnz24t-NAnWsog-eKUo8SxrGUMweuvD45kIKmQqtwEJs3w7g3Zu7Bx20HFqpglQj9E755cwx8gdaDV9z3EERWpGJZafsDf2wFDg&limit=25 HTTP/1.1", upstream: "https://10.127.105.119:443/medic/_changes?seq_interval=1000&since=2254464-g1AAAAO1eJyV069PxDAUB_ByI8FgIAGLJ4GsW9duCjyKu-sf0NdtWZbjl8BgcPwH_AEY2DX8BwgMGncKhyIXJJJdebTIC0lT8U3avE-a99oJIWS9iUqyoc8udVPCIU3EfoyLTvBooAhsSTlumwhI9PV2gntriokMNF9W84_060jYNWb6h306LGcV5QkLxdAxcGDtwmPvvcMyoVQOwRg6FvB2I491dw7jteKMZaEYOmNojek8djH0PYOCp7oOxdCZwrW1vcc2jxxW0xRYUoVi6CxOV8lQylvHDV4-HBdXcaF1GcihNEIJ4wFn4cGbcwcyyquEhY4BpQ4ljG9rnz24t-NAnWsog-eKUo8SxrGUMweuvD45kIKmQqtwEJs3w7g3Zu7Bx20HFqpglQj9E755cwx8gdaDV9z3EERWpGJZafsDf2wFDg&limit=25", host: "chis-training-3.health.go.ke"
from the nairobi error logs (and other training logs), there are some similiar upstream timed out (110: Connection timed out) while connecting to upstream
This message definelty means the timeout is getting the connection, not that upstream is taking too long to process the request.
but also there are some message with (24: Too many open files) like
2024/08/07 18:35:28 [alert] 191941#191941: *35631533 socket() failed (24: Too many open files) while connecting to upstream, client: 10.127.105.170, server: nairobi-echis.health.go.ke, request: "GET /medic/_changes?seq_interval=1000&since=2083664-g1AAAAO1eJyV0zFLAzEUB_DYKwjioIIfwNVBLne5S2_SL-CgbRzc8nJ3HEetk5sg-B1cXNy0zeYncFa63mcQ6uxmG595axFChj8k5EfyXjJmjG03Ucl2zfWNaUo44Yk8inHwMS71NIN9pUZtEwGLnt6vcG5TC5mBydft-Uf6cxQcWjsj7H7PYwNR8TwRoRg6Fo6dWxEWX3gsk1oPIBhDxwGebkhY_9Njea1zIbJQDJ0RtNZOPdb7pmtqKPLU1KEYOjO4c25J2OOrx2qegkiqUAyd1aTPzpV6IG5y6bm4igtjykAOpSFKGC_YCwJPPzwoeF4lIrQNKE1Rwvhx7o3Ag86DZmCgDO4rSkuUMM6U6gjcmXuQg-HS6HAQi9dhPFu7IHBLeLDQhahk6J-g4i0w8AU6D2583VINQWZFKtdtbX8BLcQCQw&limit=25 HTTP/1.1", upstream: "https://10.127.105.123:443/medic/_changes?seq_interval=1000&since=2083664-g1AAAAO1eJyV0zFLAzEUB_DYKwjioIIfwNVBLne5S2_SL-CgbRzc8nJ3HEetk5sg-B1cXNy0zeYncFa63mcQ6uxmG595axFChj8k5EfyXjJmjG03Ucl2zfWNaUo44Yk8inHwMS71NIN9pUZtEwGLnt6vcG5TC5mBydft-Uf6cxQcWjsj7H7PYwNR8TwRoRg6Fo6dWxEWX3gsk1oPIBhDxwGebkhY_9Njea1zIbJQDJ0RtNZOPdb7pmtqKPLU1KEYOjO4c25J2OOrx2qegkiqUAyd1aTPzpV6IG5y6bm4igtjykAOpSFKGC_YCwJPPzwoeF4lIrQNKE1Rwvhx7o3Ag86DZmCgDO4rSkuUMM6U6gjcmXuQg-HS6HAQi9dhPFu7IHBLeLDQhahk6J-g4i0w8AU6D2583VINQWZFKtdtbX8BLcQCQw&limit=25", host: "nairobi.echis.go.ke"
Looking at the state of the pods now:
cht-sync-couch2pg--chis-training-2-health-go-ke-547d647449gkhwb 1/1 Running 10 (2d20h ago) 6d5h
cht-sync-couch2pg--chis-training-3-health-go-ke-74cfb748f78whkn 1/1 Running 10 (2d20h ago) 6d5h
cht-sync-couch2pg--chis-training-health-go-ke-857f8b8ff-5lh94 1/1 Running 32 (2d20h ago) 6d5h
Looking at last crashes:
2024-08-12T17:07:48.609807241Z ^
2024-08-12T17:07:48.610167246Z FetchError: request to https://chis-training-2.health.go.ke/_session failed, reason: connect ECONNREFUSED 197.248.202.213:443
2024-08-12T17:07:48.610183093Z at ClientRequest.<anonymous> (/node_modules/node-fetch/lib/index.js:1491:11)
2024-08-12T17:07:48.610187448Z at ClientRequest.emit (node:events:519:28)
2024-08-12T17:07:48.610190860Z at TLSSocket.socketErrorListener (node:_http_client:500:9)
2024-08-12T17:07:48.610194018Z at TLSSocket.emit (node:events:519:28)
2024-08-12T17:07:48.610197835Z at emitErrorNT (node:internal/streams/destroy:169:8)
2024-08-12T17:07:48.610204255Z at emitErrorCloseNT (node:internal/streams/destroy:128:3)
2024-08-12T17:07:48.610210655Z at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
2024-08-12T17:07:48.610216407Z type: 'system',
2024-08-12T17:07:48.610220589Z errno: 'ECONNREFUSED',
2024-08-12T17:07:48.610224525Z code: 'ECONNREFUSED'
2024-08-12T17:07:48.610228708Z }
2024-08-12T17:07:48.610232385Z
2024-08-12T17:07:48.610236046Z Node.js v20.16.0
2024-08-12T17:07:53.705129905Z FetchError: request to https://chis-training-3.health.go.ke/_session failed, reason: connect ECONNREFUSED 197.248.202.213:443
2024-08-12T17:07:53.705147556Z at ClientRequest.<anonymous> (/node_modules/node-fetch/lib/index.js:1491:11)
2024-08-12T17:07:53.705152027Z at ClientRequest.emit (node:events:519:28)
2024-08-12T17:07:53.705155295Z at TLSSocket.socketErrorListener (node:_http_client:500:9)
2024-08-12T17:07:53.705160004Z at TLSSocket.emit (node:events:519:28)
2024-08-12T17:07:53.705164463Z at emitErrorNT (node:internal/streams/destroy:169:8)
2024-08-12T17:07:53.705168002Z at emitErrorCloseNT (node:internal/streams/destroy:128:3)
2024-08-12T17:07:53.705171086Z at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
2024-08-12T17:07:53.705174122Z type: 'system',
2024-08-12T17:07:53.705177043Z errno: 'ECONNREFUSED',
2024-08-12T17:07:53.705180090Z code: 'ECONNREFUSED'
2024-08-12T17:07:53.705183077Z }
2024-08-12T17:07:53.705187603Z
2024-08-12T17:07:53.705192001Z Node.js v20.16.0
2024-08-12T17:07:25.601083819Z /node_modules/node-fetch/lib/index.js:1491
2024-08-12T17:07:25.601091706Z reject(new FetchError(`request to ${request.url} failed, reason: ${err.message}`, 'system', err));
2024-08-12T17:07:25.601098024Z ^
2024-08-12T17:07:25.601417868Z FetchError: request to https://chis-training.health.go.ke/_session failed, reason: connect ECONNREFUSED 197.248.202.213:443
2024-08-12T17:07:25.601446968Z at ClientRequest.<anonymous> (/node_modules/node-fetch/lib/index.js:1491:11)
2024-08-12T17:07:25.601452173Z at ClientRequest.emit (node:events:519:28)
2024-08-12T17:07:25.601455493Z at TLSSocket.socketErrorListener (node:_http_client:500:9)
2024-08-12T17:07:25.601458974Z at TLSSocket.emit (node:events:519:28)
2024-08-12T17:07:25.601462005Z at emitErrorNT (node:internal/streams/destroy:169:8)
2024-08-12T17:07:25.601465135Z at emitErrorCloseNT (node:internal/streams/destroy:128:3)
2024-08-12T17:07:25.601468083Z at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
2024-08-12T17:07:25.601471116Z type: 'system',
2024-08-12T17:07:25.601475576Z errno: 'ECONNREFUSED',
2024-08-12T17:07:25.601479960Z code: 'ECONNREFUSED'
2024-08-12T17:07:25.601483288Z }
2024-08-12T17:07:25.601486301Z
2024-08-12T17:07:25.601489156Z Node.js v20.16.0
Unfortunately I can't get logs for previous pods (thanks k8s), but it seems like these last crashes, that happened 3 days ago, had a real connectivity underlying problem, since they all happen at toughly the same time.
At the moment, the cht-sync instances are stable and are up to date. If we want to further investigate this behavior, we could create a new postgres database (temporarily) and sync one of those training instances again, and watch. I am still suspecting some higher level networking timeout issue, rather than either a cht-sync or cht issue.
connection refused could correspond to this message in the nginx logs:
2024/08/07 18:35:28 [alert] 191941#191941: *35631533 socket() failed (24: Too many open files) while connecting to upstream, client: 10.127.105.170, server: nairobi-echis.health.go.ke, request: "GET /medic/_changes?seq_interval=1000&since=2083664-g1AAAAO1eJyV0zFLAzEUB_DYKwjioIIfwNVBLne5S2_SL-CgbRzc8nJ3HEetk5sg-B1cXNy0zeYncFa63mcQ6uxmG595axFChj8k5EfyXjJmjG03Ucl2zfWNaUo44Yk8inHwMS71NIN9pUZtEwGLnt6vcG5TC5mBydft-Uf6cxQcWjsj7H7PYwNR8TwRoRg6Fo6dWxEWX3gsk1oPIBhDxwGebkhY_9Njea1zIbJQDJ0RtNZOPdb7pmtqKPLU1KEYOjO4c25J2OOrx2qegkiqUAyd1aTPzpV6IG5y6bm4igtjykAOpSFKGC_YCwJPPzwoeF4lIrQNKE1Rwvhx7o3Ag86DZmCgDO4rSkuUMM6U6gjcmXuQg-HS6HAQi9dhPFu7IHBLeLDQhahk6J-g4i0w8AU6D2583VINQWZFKtdtbX8BLcQCQw&limit=25 HTTP/1.1", upstream: "https://10.127.105.123:443/medic/_changes?seq_interval=1000&since=2083664-g1AAAAO1eJyV0zFLAzEUB_DYKwjioIIfwNVBLne5S2_SL-CgbRzc8nJ3HEetk5sg-B1cXNy0zeYncFa63mcQ6uxmG595axFChj8k5EfyXjJmjG03Ucl2zfWNaUo44Yk8inHwMS71NIN9pUZtEwGLnt6vcG5TC5mBydft-Uf6cxQcWjsj7H7PYwNR8TwRoRg6Fo6dWxEWX3gsk1oPIBhDxwGebkhY_9Njea1zIbJQDJ0RtNZOPdb7pmtqKPLU1KEYOjO4c25J2OOrx2qegkiqUAyd1aTPzpV6IG5y6bm4igtjykAOpSFKGC_YCwJPPzwoeF4lIrQNKE1Rwvhx7o3Ag86DZmCgDO4rSkuUMM6U6gjcmXuQg-HS6HAQi9dhPFu7IHBLeLDQhahk6J-g4i0w8AU6D2583VINQWZFKtdtbX8BLcQCQw&limit=25", host: "nairobi.echis.go.ke"
so there's at least two issues
Per yesterday's weekly sync, closing this issue as the error doesn't seem to be related to the CHT Sync.
There's been two separate instances reported that when CHT Sync was turned on, the CHT Core instance that couch2pg runs against experienced500
errors. We should investigate this and:Re-appropriating this ticket, but keeping former content so there's continuity of the ticket. This turned out to be crashes in CHT sync and no crashes were found in CHT Core. So, with that in mind, we should:
see if they're indeed caused by couch2pg