sul-dlss / dlme

Digital Library of the Middle East web application, based on Spotlight
https://dlmenetwork.org/
Other
20 stars 2 forks source link

Make the DLME solr infrastructure more robust. #1322

Closed aaron-collier closed 2 years ago

aaron-collier commented 3 years ago

https://app.honeybadger.io/projects/53082/faults/81160819

was the result of SOLR not running on the server. We should have a more robust process for discovering and recovering from this situation..

The current DLME solr instance is running on a single node EC2 / ECS instance for each environment. The webapp node point at the solr node by IP. The solr instance is (seemingly) not directly monitored or alerted on.

Things to consider doing:

anarchivist commented 2 years ago

A couple points from the most recent Solr outage we saw, raised by @cbeer (feel free to edit if I didn't get the nuance right):

thatbudakguy commented 2 years ago

incident report post-2.4.1 deploy:

  1. full reindex was kicked off via ECS exec via the process mentioned in this comment
  2. @rsmith11 reported that nagios was alerting solr was inaccessible
  3. i tried to run a search at dlmenetwork.org and received a 500 error
  4. i checked the aws cloudwatch logs and found that the error was due to solr not responding to the spotlight instance
  5. @cbeer and i ssh'ed into the bastion host, and then into the solr vm itself. logs in /var/solr/log/ indicated that the out-of-memory killer had terminated solr
  6. ps aux | grep solr revealed that solr had been started with the -Xms512m option (512meg java heap size), which is the default, but the instance had 4G of RAM available
  7. we altered the solr config file /etc/defaults/solr.in.sh to uncomment and set SOLR_HEAP="2048m" (2G heap size) and restarted solr with service solr restart
  8. the nagios alerts cleared
corylown commented 2 years ago

Similar story this morning 3/4/22:

  1. @jacobthill reported via slack that dlme prod was returning a 500 for requests involving a solr request.
  2. I confirmed the issue and that solr was not responding.
  3. I connected to the solr vm through the bastion host and found that the out of memory killer had terminated solr.
  4. I restarted solr with service solr restart and confirmed that dlme prod was responding normally after the restart.
thatbudakguy commented 2 years ago
corylown commented 2 years ago

On March 8, 2022 we migrated the instance type for DLME Solr prod from t2.medium to t2.large and increased the SOLR_HEAP to 4G.

corylown commented 2 years ago

Closing this issue because the plan is to move the web application & solr on premise, which will take advantage of onsite infrastructure and monitoring.