[DISCUSS] [FEATURE] Promote KMeans training and prediction as a top level core plugin

opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.

Apache License 2.0

96 stars 135 forks source link

[DISCUSS] [FEATURE] Promote KMeans training and prediction as a top level core plugin #1172

Open nknize opened 1 year ago

nknize commented 1 year ago

Is your feature request related to a problem? ML training and prediction (e.g., kmeans) should be a core feature of OpenSearch minimum distribution. The API developed in ml-commons (e.g., MLEngine methods like train, deploy, predict) would be useful for top of funnel developers to explore other native training models that do not require sucking data out of the OpenSearch Node. This is a form of OpenSearch Node Level compute which is very useful for on-the-fly exploratory data analysis (EDA).

What solution would you like? Move the MLCommons foundation API to a core :libs:opensearch-ml top level library and the concrete implementations (e.g., KMeans trainers and predictors) to a :plugins:core-ml plugin that is available on the core distribution. Ideally this will relegate the entire ml-commons repository to more advanced compute-heavy trainers that we don't want as part of the core (e.g., non-linear, parametric). Or it would remove the need for this repo altogether which would simplify the testing and reduce core fragmentation across the ecosystem.

What alternatives have you considered? Leave as is.

nknize commented 1 year ago

/cc @saratvemulapalli

ylwu-amzn commented 1 year ago

This is interesting topic: putting more functionality in core or splitting it out into a separate repo. For current ml-commons OpenSearch repo, we following the second option to build vertical experience, rather than putting more and more vertical functionality to core. These two options have pros and cons. Putting it into core can make thing reusable. The cons is community will face a higher bar to tune the framework as core repo is much bigger.

Our goal is to make ml-commons as a common layer for ML/AI things. That seems no much fragmentation as this is the only common place for ML/AI. For "simplify the testing" , I think that may be not true if put ml-commons into core. ML/AI is quite special and vertical thing, put it into core will make testing much more complicated, thinking about the model deployment, hardware support like GPU etc. So I don't feel it's a must have to move to core. Let's wait for more comments before making decision.

nknize commented 1 year ago

The cons is community will face a higher bar to tune the framework as core repo is much bigger.

This isn't an issue. Plugins, Modules, and Libraries are built as separate artifacts. So it's not making the consumables "bigger" it's colocating the min distribution code into the min distribution core repo.

Our goal is to make ml-commons as a common layer for ML/AI things

That's exactly what this proposal is doing, but moving it to a build artifact created in the min distribution repo.

That seems no much fragmentation.

This proposal is the opposite of fragmentation. It's consolidation. The proposal is to provide KMeans clustering as a core feature in the core repo built as a core artifact and packaged as part of the min distribution. The downstream repo is relegated to more specialized fine tuned algorithms that extend this core functionality that might take other third-party (e.g., c++ native, GPU optimized) dependencies.

Update to provide more clarity:

Have a look at #5910 for the motivation to modularize core OpenSearch min distribution into a set of libraries, modules, and plugins that simplify the dependency tree and provide core semver guarantees. This would be providing an OTB ML as part of that min distribution. KMeans clustering is a pretty "simple" foundational capability. It doesn't have to be installed by default (hence it could be a core plugin) but with a core ML library we can write MLEngine logic baked in as a foundation service provider that is part of the core Node (or written as separable calls in a serverless implementation). The only reason not to do this is if we don't want min distribution to provide any ML capabilities at all (FWIW, I think is a bad decision).

ylwu-amzn commented 1 year ago

Thanks @nknize for the elaboration of each concern.

This isn't an issue. Plugins, Modules, and Libraries are built as separate artifacts.

This one seems confusing to me, if they are built as separate artifacts, what's the difference with current separate repo? Any reason we must do this?

That's exactly what this proposal is doing, but moving it to a build artifact created in the min distribution repo.

Same question as last one, confused why this is must do to move around to another place (min distribution repo).

The proposal is to provide KMeans clustering as a core feature in the core repo built as a core artifact and packaged as part of the min distribution.

This will not be just moving KMeans clustering as core feature in core repo. As KMeans depends on the ML framework in ml-commons. So this means we should move the whole ML framework to core repo, then we can move KMeans to core.

The downstream repo is relegated to more specialized fine tuned algorithms that extend this core functionality that might take other third-party (e.g., c++ native, GPU optimized) dependencies.

Based on my last concern, we have to move the whole ML framework to core. This means we have to work on two repos for future: core repo for ML framework and ml-commons for fine tuned algorithms. That will add more work. Is this necessary to split things into two repos ?

nknize commented 1 year ago

what's the difference with current separate repo?

Because core modules and plugins can take dependencies on core libraries but they can't take dependencies on downstream repos.

This will not be just moving KMeans clustering as core feature in core repo. As KMeans depends on the ML framework in ml-commons. So this means we should move the whole ML framework to core repo, then we can move KMeans to core.

I don't think that's a problem. If we end up moving all of ml-commons as a core library and a core module then that just might make the min distribution stronger. Do we not have any other training algorithms besides kmeans? Are there other offerings we could explore as part of a downstream ml-extras repo (e.g., native GPU or SIMD accelerated trainers that are more hardware and platform specific that they wouldn't make sense being in the core open source offering)?

Based on my last concern, we have to move the whole ML framework to core.

❤️ Maybe that's the right thing to do then.

For more context, I'm thinking about this same approach for KNN and Vectors. I think with the industry demand for vector search we should explore a min distribution module that provides vector search as a core capability and refactor the KNNEngine as part of the core distribution. Downstream KNN can provide native accelerated offerings like FAISS and nmslib, but the core distribution provides Lucene HNSW as an OEM implementation.

ylwu-amzn commented 1 year ago

ml-commons doesn't need to take dependencies on core libraries.

For more context, I'm thinking about this same approach for KNN and Vectors.

Thanks for sharing the big picture. To learn more, do you also have plan to make alerting , AD, SQL/PPL as part of core?

nknize commented 1 year ago

ml-commons doesn't need to take dependencies on core libraries.

Wait.. yes I think it does? It's the only way to pull in the Settings, Action, Plugin dependency classes. Or am I missing something here? The plugin build.gradle explicitly takes a dependency on org.opensearch core

To learn more, do you also have plan to make alerting , AD, SQL/PPL as part of core?

For SQL/PPL, yes (/cc @anirudha). We're discussion SQL as a default query language and PPL as the default processing language for observability provided by the min distribution (similar to painless as the default scripting language). They will likely be plugins not installed by default but there will be a :libs:opensearch:query-language library for the community to be able to "bring their own query language" as a downstream plugin or extension. Jury is still out for Alerting and AD but I think a foundation implementation for these may also be desirable.

ylwu-amzn commented 1 year ago

It's the only way to pull in the Settings, Action, Plugin dependency classes

For now , all OpenSearch plugins have access to these dependency classes. No need to move to core repo to use these classes.

dblock commented 1 year ago

My ideal mental model for core is to 1) only include constructs that are immutable or generic, and are unlikely to have multiple implementations, 2) be essential for the most minimal version of the search engine, 3) must ship together, ie. tightly coupled.

An easy to grok example is the interface for a repository: it should be in core because without storage the engine can't work, but the implementations for S3 vs. Azure should live outside of the repo because they are different, and because there might be multiple implementations (e.g. s3 storage looks very different from azure blob storage), and because there's no reason why repository-s3 can't ship separately from discovery-ec2, and less often if there are no changes, or much more often than core if there are.

Following this model, I would put the concept of training into core, but I would not put the actual training into core.

ylwu-amzn commented 1 year ago

My ideal mental model for core is to 1) only include constructs that are immutable or generic, and are unlikely to have multiple implementations, 2) be essential for the most minimal version of the search engine, 3) must ship together, ie. tightly coupled.

Thanks @dblock. I think that's a good high level guideline for what should go to core, what should not.

Does anyone see/receive any community user asking putting the training into core ? To me seems only part of customer who needs to run ML need this, seems not meet the point 2 and 3 2) be essential for the most minimal version of the search engine and 3) must ship together, ie. tightly coupled.

BTW, I remvoed the untriaged label as this is more for discussion and a long-term topic. Keep it as untriaged will distract our and community's attention.