Closed aaron-collier closed 2 years ago
A couple points from the most recent Solr outage we saw, raised by @cbeer (feel free to edit if I didn't get the nuance right):
incident report post-2.4.1 deploy:
/var/solr/log/
indicated that the out-of-memory killer had terminated solrps aux | grep solr
revealed that solr had been started with the -Xms512m
option (512meg java heap size), which is the default, but the instance had 4G of RAM available/etc/defaults/solr.in.sh
to uncomment and set SOLR_HEAP="2048m"
(2G heap size) and restarted solr with service solr restart
Similar story this morning 3/4/22:
service solr restart
and confirmed that dlme prod was responding normally after the restart.docValues
to improve performanceOn March 8, 2022 we migrated the instance type for DLME Solr prod from t2.medium to t2.large and increased the SOLR_HEAP
to 4G.
Closing this issue because the plan is to move the web application & solr on premise, which will take advantage of onsite infrastructure and monitoring.
https://app.honeybadger.io/projects/53082/faults/81160819
was the result of SOLR not running on the server. We should have a more robust process for discovering and recovering from this situation..
The current DLME solr instance is running on a single node EC2 / ECS instance for each environment. The webapp node point at the solr node by IP. The solr instance is (seemingly) not directly monitored or alerted on.
Things to consider doing: