mediacloud / news-search-api

Internal API server that offers search access to the Media Cloud Online News Archive (in Elasticsearch).
https://mediacloud.org
GNU Affero General Public License v3.0
1 stars 3 forks source link

throwing index name errors post ILM migration #63

Closed rahulbot closed 6 months ago

rahulbot commented 6 months ago

After the ILM migration we have an issue where we can't search via the new mc_search alias; we've had to hard-code the single index name in the providers library. We need to fix this, because it might be the cause of a downstream problem with paging (and it won't work as soon as we have a 2nd index via the new ILM features.

Note: Index-specific queries are a not user requirement. From the data consumer point of view throughout the read-only part of our stack the corpus of stories is considered to be a single data store.

rahulbot commented 6 months ago

@kilemensi chimed in with "my vote would be on an environmental variable for the alias name + updating code to work with alias where necessary" on slack.

pgulley commented 6 months ago

I have PR #62 which just adds the index aliases to the index discovery process. I can update that PR to also look for that environment variable so we can hard code it in the future- but I think leaving the discovery in for the moment and just merging would be the quickest way to move on that. We'd also want to merge #61- although I have a comment in there that's unaddressed. @kilemensi besides calling by id, are there any other es calls that you know would be affected by using aliases instead? I've found the documentation kind of sparse on this point.

philbudne commented 6 months ago

Speaking as someone with REALLY minimal exposure to the ES API (I only did a brief exploration of lookup by _id):

Regarding an environment variable to signify the index/alias, "mc_search" is already wired into many places: Two .json template files, and importer.py, so unless it's fair game to wire it into both "providers" and "news-search-ui", but (for some reason) I'm hesitant to remove the idea of the API being able to access multiple "collections" even if code both above and below it have the collection name hard wired in.

Given that we made a last minute change to make the alias a prefix of the generated index names, I think it's TOTALLY fair to lean into this fact and use a wildcard built on the alias name, or the alias, whichever makes more sense at a given spot in the code.

rahulbot commented 6 months ago

Looks like #62 resolved this.