Make the DLME solr infrastructure more robust.

sul-dlss / dlme

Digital Library of the Middle East web application, based on Spotlight

https://dlmenetwork.org/

Other

20 stars 2 forks source link

Make the DLME solr infrastructure more robust. #1322

Closed aaron-collier closed 2 years ago

aaron-collier commented 3 years ago

https://app.honeybadger.io/projects/53082/faults/81160819

was the result of SOLR not running on the server. We should have a more robust process for discovering and recovering from this situation..

The current DLME solr instance is running on a single node EC2 / ECS instance for each environment. The webapp node point at the solr node by IP. The solr instance is (seemingly) not directly monitored or alerted on.

Things to consider doing:

[ ] use DNS names for the solr instance, not IP. This should allow us to build, rebuild and scale the stack without needing to rebuild the webapp environment too
[ ] expand the production solr instance into a multinode solrcloud cluster
[ ] adding monitoring + alerting to the solr endpoints
[ ] (from notes on #1473) the Solr EC2 instance does not (but should) have the Cloudwatch Agent installed so we get disk + RAM alerts from EC2.
[ ] make sure the solr data is persisted outside the specific node

anarchivist commented 2 years ago

A couple points from the most recent Solr outage we saw, raised by @cbeer (feel free to edit if I didn't get the nuance right):

Solr only has 1 VM associated with it, which means it's not super resilient. Should we add another/set up SolrCloud/etc.?
Monitoring is a need; Ops didn’t get an alert that the Solr VM’s disk filled up, only that the application was failing its status check
When we redeploy Solr, we end up blowing the data way in the index - this is less than ideal, so how can we address/improve this?

thatbudakguy commented 2 years ago

incident report post-2.4.1 deploy:

full reindex was kicked off via ECS exec via the process mentioned in this comment
@rsmith11 reported that nagios was alerting solr was inaccessible
i tried to run a search at dlmenetwork.org and received a 500 error
i checked the aws cloudwatch logs and found that the error was due to solr not responding to the spotlight instance
@cbeer and i ssh'ed into the bastion host, and then into the solr vm itself. logs in /var/solr/log/ indicated that the out-of-memory killer had terminated solr
ps aux | grep solr revealed that solr had been started with the -Xms512m option (512meg java heap size), which is the default, but the instance had 4G of RAM available
we altered the solr config file /etc/defaults/solr.in.sh to uncomment and set SOLR_HEAP="2048m" (2G heap size) and restarted solr with service solr restart
the nagios alerts cleared

corylown commented 2 years ago

Similar story this morning 3/4/22:

@jacobthill reported via slack that dlme prod was returning a 500 for requests involving a solr request.
I confirmed the issue and that solr was not responding.
I connected to the solr vm through the bastion host and found that the out of memory killer had terminated solr.
I restarted solr with service solr restart and confirmed that dlme prod was responding normally after the restart.

thatbudakguy commented 2 years ago

[x] turn on docValues to improve performance

corylown commented 2 years ago

On March 8, 2022 we migrated the instance type for DLME Solr prod from t2.medium to t2.large and increased the SOLR_HEAP to 4G.

corylown commented 2 years ago

Closing this issue because the plan is to move the web application & solr on premise, which will take advantage of onsite infrastructure and monitoring.