mitodl / ol-infrastructure

Infrastructure automation code for use by MIT Open Learning
BSD 3-Clause "New" or "Revised" License
47 stars 4 forks source link

Reduce Impact of reindex events from OCW / Open Discussions #1199

Open Ardiea opened 2 years ago

Ardiea commented 2 years ago

A reindex event from OCW / Open Discussions can have a serious negative impact on the availability of OCW search. How can we reduce that impact?

blarghmatey commented 2 years ago

One option to consider is deploying an "ingest node" to handle the heavy lifting of mass re-index events so that search performance is not negatively impacted.

pdpinch commented 2 years ago

Is this related to https://sentry.io/organizations/mit-office-of-digital-learning/issues/3532770150/ ?

Ardiea commented 1 year ago

There was another batch of error messages and alerts on Feb 13th.

also we can set the number of replicas on our indices to 3, which is the min number of nodes we ever have in our clusters. That will enable the server that receives the request to service the request, rather than having to route it to another node 33% of the time, adding another transport + not efficiently using the resources we pay for. I think we can do that here: https://github.com/mitodl/open-discussions/blob/master/search/indexing_api.py#L312 https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#dynamic-index-settings

https://discuss.elastic.co/t/connection-pooling-in-python/189725/6 I think we can try setting that in this dict here https://github.com/mitodl/open-discussions/blob/6e7a1081eee6f8515afc29ed230148e2741dd5f7/search/connection.py#L23-L30 Based on those statements, it is either deciding that the pool will be 10 or 30 depending on wether we are ‘sniffing’ the number of nodes.