address any errors or oddness from scheduled full cluster reboot

rahulbot commented 5 months ago

We had a scheduled outage Wed June 5th to replace batteries (#285) and try out rebooting. What went wrong or merits investigation? Please add items to this item list, and/or add a comment with relevant explanation (@thepsalmist @philbudne). Or break off into new issues if more complex. Things I noted:

we were unable to SSH to ramos, bradley, woodward for a few hours
ES didn't restart on Ramos

philbudne commented 5 months ago

ES didn't restart on Ramos

Ramos came back up without the array filesystem (/srv/data) mounted. I'm not aware that any manual action was taken to restart ES once the array FS reappeared.

I can't help wondering if all the delays were due to filesystem checks; ramos has the largest array, and took the longest to come back (despite being the system with the most CPU/memory).

I think the actions to consider:

Understand the restart delays, and any actions that might be taken to reduce/eliminate them.
Consider running news-search-api instance on all ES servers (with each NSA instance speaking to only the local ES server): this would have allowed the web search site to become available as soon as any two out of three ES servers were available.
Longer term: make it possible to run the pipeline when only two ES servers are available (NOTE! permanent loss of an ES server is precursor to total failure, and is NOT to be taken lightly). Two approaches:

a. Run the indexer stack on a docker swarm/cluster consisting of all ES servers.

b. Run the indexer stack on a "compute" docker server (or cluster) SEPARATE from the ES servers. Consider running importer on each ES server speaking only to the local ES.

pgulley commented 4 months ago

More notes on why ramos and bradley took another two hours to become SSH-able:

If I remember correctly, the delay was connected to us needing to go through and change the UEFI/BIOS settings on all of the machines after replacing the CMOS batteries. And there were a few of your machines that still required BIOS for some reason, unlike the others.

And a related note on UPS:

Also: if you ever plan on replacing your battery backup (UPS) devices, please check in with me first about your plans. Most of our racks are set up for 220V and your two UPS are 110V. We’d like to have everything standardized to 220V, if possible.

pgulley commented 4 months ago

@philbudne had suggested getting umass it to run a reboot test to watch the reboot process- is this still needed after the above response?

kilemensi commented 4 months ago

... having gone through the message, my vote would be YES. Every previous outage has uncovered issues with the current setup (misconfigure software, hardware, etc.). I think it' important to test a reboot just to be sure that we're now PROD proper (as well as determine how long does the system take to be back up on a good day).

pgulley commented 4 months ago

Pending a fan replacement on one of the machines from them- we can bundle these tasks for them when that is scheduled.

mediacloud / story-indexer

address any errors or oddness from scheduled full cluster reboot #301