mediacloud / story-indexer

The core pipeline used to ingest online news stories in the Media Cloud archive.
https://mediacloud.org
Apache License 2.0
2 stars 5 forks source link

Elasticsearch secrets & repository configuration #297

Closed thepsalmist closed 4 months ago

thepsalmist commented 5 months ago

Building off some of the limitations noted on the current elastic-conf service, partly mentioned here

We need to automate setup of Elasticsearch's S3 secrets, access-key and secret-acces-key to enable us take snapshots and upload to S3. The current setup leverages the secrets and repository that were setup when doing the first manual ES snapshot. Elasticsearch recommends adding the secrets to elasticsearch-keystore as per the following commands.

Questions

  1. We currently use elastic-conf as our entry point to ES configurations.Setting the above commands via elastic-conf would mean using subprocess with sudo access. This does not sound like a good idea and therefore elastic-conf would not be the ideal place for this?
  2. We need to cater for ES running on bare metal and docker containers (dev/staging)

Proposal

  1. Add S3 repository creation & validation as part of the elastic-conf
  2. Create a separate script, similar to deploy.sh to set & reload secure settings to Elasticsearch (both bare metal & container setups). Should this be triggered on/from deploy.sh??
philbudne commented 5 months ago

There may be a knot of issues here.

I think the overall goal should be to do all configuration (or as much as possible) from scripts.

I wrote story-indexer/deploy.sh with the goal of being able to run it with ONLY docker group membership, and not presuming that the user has sudo access, so if there are commands or config files that require root, they belong elsewhere: Possibly in the elastic setup/install script (docker/elastic-deploy.sh?) using parameters from the config repo.

Further thoughts/notes:

philbudne commented 5 months ago

Continuing.. *Since staging runs ES under docker, if the keystore and/or repo config cannot be done from elastic-conf.py, that means a staging stack cannot be created from scratch without manual intervention.

kilemensi commented 5 months ago

I think the best way forward is to break down this issue into at least 3 separate tasks:

  1. Managing Elasticsearch secrets: Ideally deploy.sh will be all we need but given that the security settings can only be managed via the elasticsearch-command, we may have to implement this via our Elasticsearch deployment scripts: elastic-deploy.sh for bare-metal/PROD cluster and either bind-mount or custom image (or custom CMD/Entrypoint script) for docker/staging cluster
  2. Snapshot repository: Again, two subtasks here: i. creating an S3 bucket, and ii. registering an S3 bucket as a snapshot repository in Elasticsearch. Should the elastic-conf script do both of these tasks or should the creation of S3 bucket remain outside of this script management?
  3. SLM Policy: If the script can now create/register repositories at will, how does it affect the current implementation of SLM policy management?

Since we have a manual way of doing 1 at the moment, I think we should start by implementing tasks 2 and 3 while we brainstorm on the best way to automate 1.

philbudne commented 5 months ago

Today I found out that elastic-config.py errors out if ELASTICSEARCH_SNAPSHOT_REPO is not provided (as is the case for a developer stack), and the ES index is not created.

My position has been that developer stacks should not require any external storage keys (the archiver leaves local archive files)

I think it would be fine for elastic-config.py to log missing parameters related to snapshots at ERROR priority, but I'm open to discussion...

kilemensi commented 5 months ago

Yeah @philbudne I agree on DEV working without secrets... I had suggested using Filesystem repository for DEV a while back, not sure if you and @thepsalmist have had the chance to look at it or not.