opensearch-project / opensearch-migrations

All things migrations and upgrades for OpenSearch
Apache License 2.0
32 stars 27 forks source link

Managing Index Setups & Configuration in Microservices Environment #148

Open rursprung opened 1 year ago

rursprung commented 1 year ago

Is your feature request related to a problem? Please describe. when deploying OpenSearch as part of a larger application fleet in an environment (in our case: kubernetes) where any installation/update must be 100% hands-off (i.e. fully automated) and esp. when the connected applications are microservices (i.e. lots of them, various versions of the same in parallel due to canary upgrades or just in general rolling upgrades) it's very hard to actually set up the proper index structures & general settings on OpenSearch:

Describe the solution you'd like there should be a way for consumer applications to manage opensearch indices in a similar way as can be done with liquibase for RDBMS (SQL-based relational DBs). there it's possible to define upgrade scripts and liquibase then keeps track of what has already been applied and what hasn't (by storing that information in dedicated table(s) on the DB). this can be used both for DDL (data definition language; e.g. changing tables) as well as DML (data manipulation language; e.g. migrating data) and any mixture of the two (e.g. changing an existing table schema and migrating the data in the process).

Describe alternatives you've considered

Additional context

note: while this ticket has now been opened in the main OpenSearch repository i'm not sure whether the actual solution for this will be part of this repository. i could well imagine that the solution would be a dedicated application or an OpenSearch plugin.

dblock commented 1 year ago

@rursprung Do you think https://github.com/opensearch-project/opensearch-devops is a better place for this issue?

rursprung commented 1 year ago

@rursprung Do you think https://github.com/opensearch-project/opensearch-devops is a better place for this issue?

thanks for pointing to that repo! yes, it could well be that it's a better fit there - the question just is whether it'll garner as much interaction as it would here? but probably it'd make sense to move it.

i think this plays nicely into what i said in the meeting: there are so many repos, it's hard to keep track of where to open the ticket (and in the central ones there's so much going on that few people will check all new issues/PRs - i definitely don't have the time for that).

dblock commented 1 year ago

I've moved it. I don't know what a solution is for better tracking progress. My own GitHub notifications are wild, this is what I do, 🤷

rursprung commented 1 year ago

My own GitHub notifications are wild, this is what I do, 🤷

i went the opposite way: i've disabled all email notifications in GitHub and exclusively use the notifications page (which is a pinned tab for me in Firefox). it also offers some filter functionality, though i usually don't use it. i just have three states:

i don't know why GitHub is hiding this behind the small notification icon in the top right corner and not advertising this any further. it took me a while to even find out that this exists.

(also, we've drifted a bit off-topic, but i thought it was interesting 😆)

prudhvigodithi commented 1 year ago

Hey @rursprung from reading the description and forum post, may I know what specific part you want to keep track in OpenSearch, like the a specific document, a specific data or since you mentioned the system scales the pods (the data ingestion system) to keep track of the POD_NAME (which is randomly generated by the k8s) or deployment/statefulset name, saying if the data ingested from this set do not re ingest again? If so should the data ingestion system have a checkpoint from where to ingest again and not from beginning if a new pod scaled up ? @bbarani

rursprung commented 1 year ago

Hey @rursprung from reading the description and forum post, may I know what specific part you want to keep track in OpenSearch, like the a specific document, a specific data or since you mentioned the system scales the pods (the data ingestion system) to keep track of the POD_NAME (which is randomly generated by the k8s) or deployment/statefulset name, saying if the data ingested from this set do not re ingest again? If so should the data ingestion system have a checkpoint from where to ingest again and not from beginning if a new pod scaled up ? @bbarani

hm, maybe i should've given an example in the initial post of what the intention is. this isn't about managing the surrounding applications (e.g. we're consuming Kafka messages and Kafka keeps track of what you've already consumed, so restarting an importer service has no impact, it'll automatically pick up wherever it has to) or checkpoints for data ingestion. this is about managing indices, settings & co.

here's an example use-case:

  1. you deploy a fresh OpenSearch cluster
  2. you deploy your application which will continuously import some data from elsewhere (let's call this "importer")
    • due to this being the first start it needs to create the indices, set up roles, maybe create search templates, etc.
    • you'll probably run several replicas of your importer in parallel for speed & availability and all of them might start at the same time - but only one of them should do the initial setup (otherwise you're either doing too much (best case) or actively destroying something (worst case))
  3. after a while you deploy a new version of your importer which just changed something in the logic
    • this version has no changes to the indices & co., so it's just a normal application version rollover and nothing special is needed here. great!
  4. now you deploy yet another new version of your importer, but this time you had to change some stuff, e.g. add some new fields to existing indices, create a new index and set up the roles accordingly, maybe also do a reindex of some data
    • all these changes need to be applied when this version starts up for the first time. but on subsequent startups you mustn't run these operations again as they might not all be idempotent (and even if they were: it'd be an time-expensive startup!)
    • depending on how you're doing the update it might be that several replicas of the new version start at the same time - so they might all try to do the upgrade

i hope this makes things a bit clearer?

rursprung commented 1 year ago

i've spent some time to draft a solution proposal for this and would like to get your feedback on it! (if there's a better way to post this - e.g. as a PR with a markdown document in some RFC repo - please let me know!)

the main questions will be:

Solution proposal

High-Level Overview

Usage of the Tool

The tool should run before the main application and the latter should only be started if the former finished successfully. This can e.g. be achieved by something as simple as ./updater && ./app on a Unix system.

Operation Modes of the Tool

The tool will support multiple operation modes:

Open Points for Usage

Update Process

This diagram shows how the update process could work: image

Details on Locking

The lock is acquired before the version check to ensure that nobody else is in the process of doing the same thing.

If the lock cannot be acquired in a reasonable time the process will fail to avoid hanging endlessly. Depending on the environment the failure will either lead to a re-schedule (e.g. in Kubernetes) and/or will notify an operator (due to the main process not running / the failure being reported).

Open Points for Locking

Details on General Config File

Optionally, a configuration file can be provided which can contain following information:

Open Points for the General Config File

Details on Version Folders

Each version (e.g. 1.0.0, 1.0.1, 1.1.0, etc.) has its own folder. The version is the version of the schema, not of the application using it (though the two can be the same and probably will be in most cases for the sake of simplicity in case it's only a single application managing the whole schema). The version number must follow semantic versioning.

Details on Patch Files

Content of the Patch Files

The patch files will contain the following content:

The tool will then piece together a full HTTP request out of this and send it to OpenSearch. The unique identifier is used to check if the patch has already been applied in a previous version (doesn't have to be 1:1 the same patch as it could've happened in several steps in earlier versions and has been combined into one here).

Discarded Ideas for Content of the Patch Files

Open Points for Patch Files

Details on Patch File Processing

Open Points for Patch File Processing

Details on Error Handling

Details on Security

Authentication

The tool needs to provide authentication when calling OpenSearch (unless OpenSearch allows anonymous access with the necessary rights or has no security plugin installed / activated).

Open Points for Authentication

Authorization

The user used by the tool needs very wide-ranging rights, most likely full admin rights, to execute all updates (which range anywhere from managing indices and data in the indices to changing system and security configurations).

Discarded Solution Ideas

Details on Central Config Indices

Some information needs to be persisted, and it's best to do this directly in OpenSearch. For this, the following indices will be needed:

General Open Points

In no particular order:

Discarded Alternative Solution Ideas

prudhvigodithi commented 1 year ago

Hey @rursprung thanks again for putting this, I'm open for a deep dive to have a meeting (call) to go over the solution (Please let me know if that works for you). Following are questions I have.

1) How do we handle the tool for normal installation other than k8s, is the plan to run manually (or with user scripts) ./updater && ./app and then start the application?

2) The content of the patch file, with the current problem statement, the details on the patch file are index, reindex, role management, there might be lot of requests to add new settings example updating the password for new application version is it expected to take care all of this by the tool, if so there is not end for the content of the patch file.

3) If the tool used in k8s should the tool be part of every application pod init setup? I feel its not as the single process once done all the required settings part of the patch file should be good?

4) Also apart from the problem statement who are the other users who can be benefitted?, just checking if this can be served something like a product or just like the existing client. From the description added it looks like the cli client to me that can be used/not-used for problem statement added.

Thank you

rursprung commented 1 year ago

thanks for your feedback!

I'm open for a deep dive to have a meeting (call) to go over the solution (Please let me know if that works for you).

that'd be great! the question is if there are others in the community who'd be interested in joining this? (if so: please speak up!)

  1. How do we handle the tool for normal installation other than k8s, is the plan to run manually (or with user scripts) ./updater && ./app and then start the application?

i'd presume that it'd just be used as ./updater && ./app

2. The content of the patch file, with the current problem statement, the details on the patch file are index, reindex, role management, there might be lot of requests to add new settings example updating the password for new application version is it expected to take care all of this by the tool, if so there is not end for the content of the patch file.

in my design proposal the patch file does not know any of the commands specifically. it'd be something like this:

- "method": "POST"
- "path": /_reindex
- "content": "{ ... }"

so the updater doesn't need to know what all of that means. it'll just put together the request and execute it.

3. If the tool used in k8s should the tool be part of every application pod init setup? I feel its not as the single process once done all the required settings part of the patch file should be good?

it has to be part of every startup/init (k8s or not) as we cannot know what the status of the environment is. if you're setting things up in an automated way (k8s, ansible, hand-woven shell scripts, etc.) then the code doesn't know whether it'll run in a clean dev environment, on an outdated test environment or on production, so it always has to run to ensure that the setup is correct. another reason for it to always run is to make sure that there hasn't been another version of the app which has already updated the schema with a breaking change (in which case the updater would fail which in turn would prevent the app from starting and an operator could then check what's going on and remove the offending old version).

4. Also apart from the problem statement who are the other users who can be benefitted?, just checking if this can be served something like a product or just like the existing client. From the description added it looks like the cli client to me that can be used/not-used for problem statement added.

i'd say anyone running any 3rd party applications against OpenSearch (or Elasticsearch) in a distributed environment (i.e. it's possible that there's more than one replica of the application is running) or in diverse environments where they want to ensure that the cluster is automatically bootstrapped or brought to the correct version without having to encode this in their own application.

peterzhuamazon commented 1 year ago

Hi @rursprung I am not sure if this is something you are looking for? https://github.com/opensearch-project/opensearch-cli/

rursprung commented 1 year ago

no, i don't think that this covers it:

prudhvigodithi commented 1 year ago

Adding @dblock @CEHENKLE @elfisher @bbarani

peterzhuamazon commented 1 year ago

Adding @wbeckler and the @opensearch-project/clients team into the conversation. Please let us know your thoughts on this.

Thanks.

bbarani commented 1 year ago

@wbeckler @opensearch-project/opensearch-migrations can you provide your inputs?

wbeckler commented 1 year ago

@rursprung There's been some movement on building automations for https://github.com/opensearch-project/opensearch-migrations/issues/29

I know it's not exactly where you're going, but I'd be curious if you felt this had an overlap with what you're thinking.

rursprung commented 1 year ago

@rursprung There's been some movement on building automations for opensearch-project/opensearch-migrations#29

I know it's not exactly where you're going, but I'd be curious if you felt this had an overlap with what you're thinking.

thanks for this update @wbeckler! i think there's some overlap, though there are also vast differences:

thus i don't think that this can (and should) be the same tool. it has a different focus and a different target audience (yours: opensearch contributors, this: cluster admins)

i see in your issue that there was also some discussion around supporting certain cluster-admin features (e.g. validate if a cluster is ready to be upgraded). i think these things would correlate with the proposed tool here. i had not explicitly thought about this yet as in our setup everything is created in automated ways, so e.g. taking care of doing index upgrades (through re-indexing) is something you just have to do as an update script and then it'll work, but for setups where there could be indices created by users it'd be useful to explicitly check the indices for re-indexing.

davidjlynn commented 1 year ago

Hey @rursprung , I like the idea a lot in providing a method of schema migration for OpenSearch.

My thoughts would be this would give application developers a tool to reduce the amount of backwards compatible logic as their application evolves. So, while they could still handle very old versions of their schema, they can proactively choose to upgrade their schema and hence reduce a lot of historic debt. Nice.

My input would be, you have mentioned that this is along similar lines to Liquibase but for OpenSearch. In the simplest sense, Liquibase deals with lists of changesets and applies them to databases, and some of these changesets can be applied over different database types. I suggest we consider extending Liquibase via extension to support OpenSearch.

As an example, here is a Liquibase extension for MongoDB: https://github.com/liquibase/liquibase-mongodb This started life out as a custom extension (see the repository this one is a fork from) and appears to be adopted by Liquibase now as it has reached maturity.

The motivation here would be to avoid reinventing the wheel when it comes to the management of changesets and relevant formats. I would hope the extension would allow the solution to concentrate on:

This is an assumption that this is possible, but using this established framework may save effort in the long term.

wbeckler commented 1 year ago

This sounds like a really great approach. If you identify any gaps in the API of OpenSearch for this purpose, please raise an issue for that.

dblock commented 1 year ago

Looks like we narrowed this topic down to "live-migrating schema and data". I'll move the issue into opensearch-migrations (that didn't exist when this issue was created).

I really like the idea of reusing an existing framework like liquibase to describe the changes desired in a form or shape of a destination (e.g. current mapping -> new mapping). Applying the change could be implemented both in an external tool, and as an OpenSearch plugin/extension. If we go the latter route, the API could maybe be something like POST /_extensions/transform with source, target and options (e.g. all or nothing, background, live or marking an index read-only, etc.). Tasks could be queued with job scheduler. Would that be valuable?

rursprung commented 1 year ago

this is a great suggestion @davidjlynn!

i'm currently looking a bit into this and from what i've found so far liquibase isn't really built to support NoSQL databases. both liquibase-mongodb and liquibase-cosmodb have to implement fake-shims for JDBC functionality (see e.g. MongoClientDriver). they also have some somewhat-generic liquibase.nosql wrappers which are however not upstreamed to liquibase itself (or otherwise factored out of the repositories). i've raised https://github.com/liquibase/liquibase/issues/4236 for this. liquibase-neo4j even seems to implement a full JDBC driver just for liquibase and then use that. other NoSQL databases are integrated using their SQL APIs.

we could theoretically use the OpenSearch JDBC driver, however then we wouldn't be able to use native OpenSearch actions and i guess it would also not allow managing settings (esp. also of plugins)? (note: i've never used the OpenSearch JDBC driver and haven't looked into it yet). using opensearch-java would IMHO make more sense than the JDBC driver.

i very much like the idea of going with Liquibase and extending it. but i feel that if we do it the way mongodb & cosmodb were done we'll have a hacky codebase. maybe it'd make sense to check with some Liquibase developers whether there'd be a way to add better NoSQL support to liquibase-core as part of this effort?

@dblock: how would you envision the transformations with source and target to look like? if you just pass two index definitions then the system can't know which columns map to which new ones if e.g. the name changes - or would you want to annotate that somehow? i kind-of like the idea of offloading the logic to OpenSearch as that'd make it even more general (other, non-liquibase, use-cases might spring up and can then use this). though in a first version we might not need anything additional depending on how we define the configuration API: if you just define which API you want to call, the method to call it and the body to pass along then that's a very generic approach which would support anything. so this could be an abstract liquibase-http and based on that we could then add liquibase-opensearch which in a first version just deals with the management aspects (storing which updates were applied; it needs OpenSearch knowledge to create/update this index) and later offer dedicated change types to manage the data (and settings) rather than having to specify HTTP methods directly. this will then also offer smoother upgrades (the HTTP calls might break in major releases, the liquibase change types can deal with that if done properly and abstract away from it).

something else i noticed: liquibase and its extensions sadly still target Java 8 (see also https://github.com/liquibase/liquibase/issues/1677), but opensearch-java targets Java 11. so if we use that then our liquibase-opensearch extension would also target only Java 11. i hope that this is then all still compatible (except if somebody tries to run it on Java 8).

nvoxland commented 1 year ago

Hi, I'm the creator of Liquibase and found this thread from liquibase/liquibase#4236 I did expand that ticket to be the general "Make NoSQL databases easier to support epic, but even with the the slighly hackier work-arounds I think leveraging the existing Liquibase code for everything except the OpenSearch specific portions will be much easier and in the end more powerful/flexible than something from scratch. But I'm also biased :) And I'm also always available for any questions you'd have, either here or at nathan@liquibase.org.

We target Java 8 by default because there seems to be an unfortunately large percentage of people still running that and we don't want to cut them out. But extensions like OpenSearch which require 11 can certainly build with java 11, that's not a problem at all.

wbeckler commented 1 year ago

I just found this implementation of a JavaScript migrations library that is not liquibase but which does start to think about repeatable and reversible schema changes: https://nathanfries.com/posts/opensearch-migrations/

On Tue, May 16, 2023, 4:03 PM Nathan Voxland @.***> wrote:

Hi, I'm the creator of Liquibase and found this thread from liquibase/liquibase#4236 https://github.com/liquibase/liquibase/issues/4236 I did expand that ticket to be the general "Make NoSQL databases easier to support epic, but even with the the slighly hackier work-arounds I think leveraging the existing Liquibase code for everything except the OpenSearch specific portions will be much easier and in the end more powerful/flexible than something from scratch. But I'm also biased :) And I'm also always available for any questions you'd have, either here or at @.***

We target Java 8 by default because there seems to be an unfortunately large percentage of people still running that and we don't want to cut them out. But extensions like OpenSearch which require 11 can certainly build with java 11, that's not a problem at all.

— Reply to this email directly, view it on GitHub https://github.com/opensearch-project/opensearch-migrations/issues/148#issuecomment-1550286931, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5PRLRKRWJZ27GXQB5JGVDXGPMSPANCNFSM6AAAAAAXPONRKM . You are receiving this because you were mentioned.Message ID: @.*** com>

Npfries commented 1 year ago

Hi! I'm the author of that post. We have been running a more sophisticated version of that implementation in production for almost 18 months. I'm not sure I have a whole lot to add here but if there are any questions about how we're using it or improvements we are making, I'd be happy to answer.

We are executing the migrations alongside deployment of microservices in k8s, and primarily for application search.

rursprung commented 1 year ago

@wbeckler / @Npfries: this looks indeed interesting as well! however, for us it makes more sense to be based on liquibase to be more aligned and better integrated with our applications which already make use of liquibase.

@nvoxland: thanks for your reply! i had tried to contact you via email a while ago but never got a reply - could you please check your emails or drop me an email at ralph.ursprung@avaloq.com if you hadn't received it? we'd be interested in getting this started!

rursprung commented 10 months ago

@nvoxland: i still haven't given up on this - it'd be great if you could contact me so that we can get this started!