nmathew98 / shared-resources

MIT License
0 stars 0 forks source link

docs: services #21

Open nmathew98 opened 4 weeks ago

nmathew98 commented 4 weeks ago

Services startup time: ~20 minutes

nmathew98 commented 4 weeks ago

authnz fell over:

338004596-ed804dc1-8d2b-405f-9078-4a62f374c4f7

338004614-820b171a-5092-4670-9099-71f18a3c32fb

connection to db failed:

338004630-09155b8b-ade6-4135-b433-6e7b99de34ff

338004638-63443dbd-192d-4d3e-8be3-2a18b65ac77e

down time: ~1 hr

fix:

related:

additional notes:

image image

is holding one each now:

image
nmathew98 commented 3 weeks ago

rollout is down, nothing in logs:

image image image image

https://github.com/nmathew98/shared-resources/actions/runs/9543913640/job/26301571969#step:4:30

last logs from rollout:

image image image image

stopping the server completely and restarting it fixed it

all downs:

image

fix:

nmathew98 commented 2 weeks ago
previous Elastic died and restarted, unsure exactly what but logs indicate failed to write an index because unable to connect to a node (`telemetry_elasticsearch.3`, ~`2024-06-19T06:29:58,346Z`)): image image image image `elasticsearch.3` was unable to connect to `elasticsearch.1` causing a crash (~`2024-06-19T06:29:41,346Z`): image image Has been stable for 2 hours might be running into memory issues causing the nodes to die: image recovering fine but nodes are dying: image updated config: elastic heap: 1gb (from 1gb) Logstash heap: 0.5gb (from 0.5gb) tentative fix (did not work): - disable elastic memory lock to allow swap - search might be slower but we have more disk space than memory - up 2 days fine, 1 week mark should be enough, after which v0.0.2

updated config: elastic heap: 2gb (from 1gb) Logstash heap: 0.5gb (from 0.5gb) - no issues with logstash failing

day 3 of updated config:

image
nmathew98 commented 1 week ago

error with plugin loading after redeploy. can't find anything on cause, clearing session was of no use. letting things run for a while might fix it

image