docs: services - Githubissues

nmathew98 commented 4 weeks ago

telemetry: https://telemetry.skulpture.xyz
rollout: https://rollout.skulpture.xyz
authnz: https://authnz.skulpture.xyz

Services startup time: ~20 minutes

nmathew98 commented 4 weeks ago

authnz fell over:

338004596-ed804dc1-8d2b-405f-9078-4a62f374c4f7

338004614-820b171a-5092-4670-9099-71f18a3c32fb

connection to db failed:

338004630-09155b8b-ade6-4135-b433-6e7b99de34ff

338004638-63443dbd-192d-4d3e-8be3-2a18b65ac77e

down time: ~1 hr

fix:

trying pooled connections by limiting via env. pg bouncer will not work because each keycloak instance tries to make a connection on startup and if that fails it is fatal, causing an endless restart

there was a point when each container was only holding 1 connection each
each container is holding max number of allowed connections when KC_DB_POOL_MAX_SIZE = 2, not sure if will be released:

is holding one each now:

nmathew98 commented 3 weeks ago

rollout is down, nothing in logs:

unable to connect via ssh either

https://github.com/nmathew98/shared-resources/actions/runs/9543913640/job/26301571969#step:4:30

last logs from rollout:

nothing from nginx or flipt indicates the server died, looks like the vm lost its connectivity??

stopping the server completely and restarting it fixed it

all downs:

fix:

single node

nmathew98 commented 2 weeks ago

`elasticsearch.3` was unable to connect to `elasticsearch.1` causing a crash (~`2024-06-19T06:29:41,346Z`):

Has been stable for 2 hours might be running into memory issues causing the nodes to die:

recovering fine but nodes are dying:

updated config: elastic heap: 1gb (from 1gb) Logstash heap: 0.5gb (from 0.5gb) tentative fix (did not work): - disable elastic memory lock to allow swap - search might be slower but we have more disk space than memory - up 2 days fine, 1 week mark should be enough, after which v0.0.2

updated config: elastic heap: 2gb (from 1gb) Logstash heap: 0.5gb (from 0.5gb) - no issues with logstash failing

day 3 of updated config:

nmathew98 commented 1 week ago

error with plugin loading after redeploy. can't find anything on cause, clearing session was of no use. letting things run for a while might fix it

nmathew98 / shared-resources

docs: services #21