opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
99 stars 136 forks source link

[RFC] Enhancement of Multi-Tenancy Capabilities in ML-Commons #2358

Open dhrubo-os opened 7 months ago

dhrubo-os commented 7 months ago

RFC: Enhancement of Multi-Tenancy in ML-Commons Plugin

1. Introduction

This RFC details the proposed low-level design for enhancing multi-tenancy in the ML-Commons plugin, aimed at providing robust, scalable, and secure management of machine learning models for each tenant within a highly distributed environment.

2. High-Level Context

The proposed enhancements to the ML-Commons plugin are designed to support the evolving requirements of cloud architectures transitioning towards serverless models. This RFC aligns with the trend of enhancing cloud-native plugins to facilitate sophisticated multi-tenancy capabilities.

Strategic Benefits of Multi-Tenancy:

On a high level, ML-Common support can be distributed as an application based micro service.

Screenshot 2024-04-24 at 4 06 11 PM

3. Purpose

The necessity for this enhancement arises from the growing demand for platform capabilities that can support multiple tenants simultaneously without compromising on performance, security, or scalability. In this context, a "tenant" refers to an individual customer or distinct operational unit, each with its own secure data and configurations. Allowing multiple tenants to coexist on the same application installation effectively means we can manage separate customers or business units independently yet within a single shared environment. Effective multi-tenancy will enable our infrastructure to:

This approach not only maximizes resource utilization but also enhances operational flexibility, making it possible to cater to a broad spectrum of customer needs efficiently.

3. Design Considerations

4. Out of Scope

The following elements are considered out of the scope of this proposal:

5. Proposed Architecture and Solutions

5.1. Resource Separation Between Tenants

Each tenant will be identified uniquely within the system using a customer-specific ID (tenant_id). This ID will be pivotal in clustering resources such as models, agents, and connectors, ensuring that each tenant's assets are managed independently.

5.2. Data Storage Models

Silo Model (Index per Tenant):

Pool Model (Unified Index with Tenant Identifiers): (Recommended)

5.3. Remote Data Storage

ml-commons plugin makes use of index .plugins-ml-model to store ml model related metadata. With multi-tenancy plugin metadata information across different tenants will be persisted in a common data store. We will also implement a basic DAO layer to connect with any kind of remote data storage layer. Details in this RFC

5.4. API Design and Interaction

REST Layer:

POST /_plugins/_ml/models/_register?tenant_id=app-12345
{
    "name": "openAI-gpt-3.5-turbo",
    "function_name": "remote",
    "description": "test model",
    "connector_id": "a1eMb4kBJ1eYAeTMAljY"
}

Transport Layer:

{ 
  "name": "openAI-gpt-3.5-turbo", 
  "function_name": "remote", 
  "model_group_id": "1jriBYsBq7EKuKzZX131", 
  "description": "test model", 
  "connector_id": "a1eMb4kBJ1eYAeTMAljY",
  "tenant_id": "<Example application id>" 
 }

5.5. Multi-tenancy use case identifier

We need to send this tenant_id in both layers (Rest and Transport). For multi-tenancy, this field should be mandatory and for regular opensearch this field needs to be optional. To distinguish that, we can open a setting field: plugins.ml_commons.independent_node which will be true for Multi-tenancy and false for Opensearch. Depending on this field we will mark tenant_id as mandatory/optional. We need this check for every transport and rest layer action.

6. API Support and Functionality

The following API functionalities will be supported:

7. BWC & Impact to current Single cluster use case

7.1 Enhanced Flexibility with Backward Compatibility

The introduction of multi-tenancy capabilities is designed to be fully backward compatible, ensuring that existing single-cluster setups can be upgraded without disruption. This seamless integration is facilitated by the plugins.ml_commons.independent_node setting, which distinguishes between single-tenant and multi-tenant environments. This approach ensures that all current functionalities remain intact while providing the option to leverage advanced multi-tenancy features.

7.2 Adaptive Access Control for Multi-Tenant Environments

In multi-tenant configurations, traditional model access controls are temporarily disabled to pave the way for a more robust, service-based authentication and authorization mechanism tailored for complex multi-tenant dynamics. This transitional phase is crucial for developing a secure, scalable multi-tenant architecture that can support diverse and dynamic tenant requirements without compromising security.

7.3 Utilizing Tenant ID in Single Tenant Clusters

For single tenant environments, the tenant_id parameter offers an innovative way to segment and manage resources on a project-by-project basis, even within a single tenant framework. This capability allows customers to organize and isolate resources effectively, providing an added layer of flexibility and control. Customers manage the tenant_id themselves, which enhances their ability to customize the setup according to their specific operational needs.

8. System Settings and Configuration

The system will use a mix of static and dynamic settings to manage operational policies and thresholds, ensuring that each tenant's environment is optimized for performance and security while maintaining the flexibility to adjust to specific needs.

9. Conclusion

This RFC proposes a strategic enhancement to the ML-Commons plugin to support robust multi-tenancy. Feedback on this proposal is encouraged to refine the approach and ensure the successful implementation of multi-tenancy in the ML-Commons ecosystem.

saratvemulapalli commented 6 months ago

@dhrubo-os thanks for the RFC. It absolutely makes sense for auxiliary plugins to explore multi-tenancy pathways. The proposal seems to be specific to ML Commons but I believe this is a common problem for other plugins. Do you have thoughts on solving it generically ? For example, plugins like Alerting, Anomaly Detection, Flow Framework, SQL-Spark would have similar use-cases.

reta commented 6 months ago

During the meeting we have discussed if it would make sense to change ml-commons from plugin to extension [1] since it gives more freedom regarding the deployment / scalability / implementation aspects. Alternatively, there is a general discussion regarding multi-tenancy support by OpenSearch Core [2], please feel free to contribute.

[1] https://github.com/opensearch-project/opensearch-sdk-java [2] https://github.com/opensearch-project/OpenSearch/issues/13516

ansjcy commented 6 months ago

This looks like a very valid use case of the generic multi-tenancy support in OpenSearch we are discussing (as reta mentioned)!

To distinguish that, we can open a setting field: plugins.ml_commons.independent_node which will be true for Multi-tenancy and false for Opensearch. Depending on this field we will mark tenant_id as mandatory/optional. We need this check for every transport and rest layer action.

This can be done as simply adding a rule-based tenancy labeller. We have a draft PR for a similar use case to attach the tenancy label based on the authenticated user: https://github.com/opensearch-project/OpenSearch/pull/13374/files#diff-b4d03a88895891abd177d233d20dec21a8c87ec97b7042d158afeb9729f7b300 . Please take a look at the Meta issue mentioned by reta and also this draft PR to provide any feedback!

dhrubo-os commented 6 months ago

During the meeting we have discussed if it would make sense to change ml-commons from plugin to extension [1] since it gives more freedom regarding the deployment / scalability / implementation aspects.

@dbwiddis Do you have any input here?

dbwiddis commented 6 months ago

During the meeting we have discussed if it would make sense to change ml-commons from plugin to extension [1] since it gives more freedom regarding the deployment / scalability / implementation aspects.

That's a potential end game, yes! However, the extensions sdk is not in a mature enough state to make that transition right now. We did get as far with extensions as identifying NamedRoutes and integrating with the security framework's token generation to pass around authenticated user information, so there's definitely some application of the work we previously did on extensions here.

More generally, I think this is part of an overall long term plan to separate data from code and logically separate things where we can continue to use them in a cluster environment while simultaneously enabling a faster/easier transition of data manipulation (compute, memory, storage, etc.) to other environments, both ones we can think of now and ones that may not yet exist...