dhrubo-os commented 7 months ago

RFC: Enhancement of Multi-Tenancy in ML-Commons Plugin

1. Introduction

This RFC details the proposed low-level design for enhancing multi-tenancy in the ML-Commons plugin, aimed at providing robust, scalable, and secure management of machine learning models for each tenant within a highly distributed environment.

2. High-Level Context

The proposed enhancements to the ML-Commons plugin are designed to support the evolving requirements of cloud architectures transitioning towards serverless models. This RFC aligns with the trend of enhancing cloud-native plugins to facilitate sophisticated multi-tenancy capabilities.

Strategic Benefits of Multi-Tenancy:

Cost Efficiency: Multi-tenancy allows for resource consolidation across multiple tenants, reducing costs through economies of scale.
Reduced Management Overhead: Administrative tasks are centralized, significantly reducing management overhead and enhancing operational efficiency.
Scalability and Flexibility: ML-Commons support can be distributed as an application-based microservice, enhancing scalability and flexibility.

On a high level, ML-Common support can be distributed as an application based micro service.

Screenshot 2024-04-24 at 4 06 11 PM

3. Purpose

The necessity for this enhancement arises from the growing demand for platform capabilities that can support multiple tenants simultaneously without compromising on performance, security, or scalability. In this context, a "tenant" refers to an individual customer or distinct operational unit, each with its own secure data and configurations. Allowing multiple tenants to coexist on the same application installation effectively means we can manage separate customers or business units independently yet within a single shared environment. Effective multi-tenancy will enable our infrastructure to:

Securely Segregate Tenant Data: Ensure that each tenant's data is completely isolated from others, thereby protecting customer information and adhering to data privacy regulations.
Efficiently Allocate and Manage Computational Resources: Dynamically distribute system resources among tenants based on demand, ensuring optimal performance and cost efficiency.
Scale Operations Dynamically: Adjust resource allocation and service capacity in real-time as tenant demands change, thus supporting fluctuating workloads without affecting the quality of service.

This approach not only maximizes resource utilization but also enhances operational flexibility, making it possible to cater to a broad spectrum of customer needs efficiently.

3. Design Considerations

Data Isolation: Guarantee strict separation and security of each tenant's data.
Efficient Resource Management: Optimize resource use across multiple tenants to ensure scalability and performance.
Seamless Integration: Ensure smooth integration with existing data processing frameworks.
Codebase Stability: Minimize changes to the existing codebase to simplify maintenance.
Backward Compatibility: Maintain compatibility with existing installations to ensure a smooth transition.

4. Out of Scope

The following elements are considered out of the scope of this proposal:

Connectivity to specific data stores.
Authentication services.
Gateway services for request routing.
Remote metadata storage management.

5. Proposed Architecture and Solutions

5.1. Resource Separation Between Tenants

Each tenant will be identified uniquely within the system using a customer-specific ID (tenant_id). This ID will be pivotal in clustering resources such as models, agents, and connectors, ensuring that each tenant's assets are managed independently.

5.2. Data Storage Models

Silo Model (Index per Tenant):

Description: Each tenant's data is stored in separate indices, named by appending the tenant identifier to a base index name.
Pros: Ensures excellent data isolation, enhances security, and improves performance by reducing query load.
Cons: Increases the complexity of managing a large number of indices, complicates cross-tenant queries, and may lead to higher operational costs.

Pool Model (Unified Index with Tenant Identifiers): (Recommended)

Description: A single index stores all tenant data, with documents tagged with tenant identifiers.
Pros: Simplifies index management, enhances scalability, and reduces overhead.
Cons: May cause performance variability and complicate the enforcement of stringent security measures.

5.3. Remote Data Storage

ml-commons plugin makes use of index .plugins-ml-model to store ml model related metadata. With multi-tenancy plugin metadata information across different tenants will be persisted in a common data store. We will also implement a basic DAO layer to connect with any kind of remote data storage layer. Details in this RFC

5.4. API Design and Interaction

REST Layer:

Tenant IDs will be included in API requests, within URL parameters, to ensure proper routing and handling without exposing tenant details externally.
To further abstract tenant identification from users, tenant IDs will primarily be passed via HTTP headers, managed internally by the system.

POST /_plugins/_ml/models/_register?tenant_id=app-12345
{
    "name": "openAI-gpt-3.5-turbo",
    "function_name": "remote",
    "description": "test model",
    "connector_id": "a1eMb4kBJ1eYAeTMAljY"
}

Transport Layer:

The transport layer will handle tenant IDs explicitly within API requests to ensure secure and isolated internal communication between services. Example API payload:

{ 
  "name": "openAI-gpt-3.5-turbo", 
  "function_name": "remote", 
  "model_group_id": "1jriBYsBq7EKuKzZX131", 
  "description": "test model", 
  "connector_id": "a1eMb4kBJ1eYAeTMAljY",
  "tenant_id": "<Example application id>" 
 }

5.5. Multi-tenancy use case identifier

We need to send this tenant_id in both layers (Rest and Transport). For multi-tenancy, this field should be mandatory and for regular opensearch this field needs to be optional. To distinguish that, we can open a setting field: plugins.ml_commons.independent_node which will be true for Multi-tenancy and false for Opensearch. Depending on this field we will mark tenant_id as mandatory/optional. We need this check for every transport and rest layer action.

6. API Support and Functionality

The following API functionalities will be supported:

CRUD operations for connectors.
Management of remote models.
Task management APIs.

7. BWC & Impact to current Single cluster use case

7.1 Enhanced Flexibility with Backward Compatibility

The introduction of multi-tenancy capabilities is designed to be fully backward compatible, ensuring that existing single-cluster setups can be upgraded without disruption. This seamless integration is facilitated by the plugins.ml_commons.independent_node setting, which distinguishes between single-tenant and multi-tenant environments. This approach ensures that all current functionalities remain intact while providing the option to leverage advanced multi-tenancy features.

7.2 Adaptive Access Control for Multi-Tenant Environments

In multi-tenant configurations, traditional model access controls are temporarily disabled to pave the way for a more robust, service-based authentication and authorization mechanism tailored for complex multi-tenant dynamics. This transitional phase is crucial for developing a secure, scalable multi-tenant architecture that can support diverse and dynamic tenant requirements without compromising security.

7.3 Utilizing Tenant ID in Single Tenant Clusters

For single tenant environments, the tenant_id parameter offers an innovative way to segment and manage resources on a project-by-project basis, even within a single tenant framework. This capability allows customers to organize and isolate resources effectively, providing an added layer of flexibility and control. Customers manage the tenant_id themselves, which enhances their ability to customize the setup according to their specific operational needs.

8. System Settings and Configuration

The system will use a mix of static and dynamic settings to manage operational policies and thresholds, ensuring that each tenant's environment is optimized for performance and security while maintaining the flexibility to adjust to specific needs.

9. Conclusion

This RFC proposes a strategic enhancement to the ML-Commons plugin to support robust multi-tenancy. Feedback on this proposal is encouraged to refine the approach and ensure the successful implementation of multi-tenancy in the ML-Commons ecosystem.

saratvemulapalli commented 6 months ago

@dhrubo-os thanks for the RFC. It absolutely makes sense for auxiliary plugins to explore multi-tenancy pathways. The proposal seems to be specific to ML Commons but I believe this is a common problem for other plugins. Do you have thoughts on solving it generically ? For example, plugins like Alerting, Anomaly Detection, Flow Framework, SQL-Spark would have similar use-cases.

reta commented 6 months ago

During the meeting we have discussed if it would make sense to change ml-commons from plugin to extension [1] since it gives more freedom regarding the deployment / scalability / implementation aspects. Alternatively, there is a general discussion regarding multi-tenancy support by OpenSearch Core [2], please feel free to contribute.

[1] https://github.com/opensearch-project/opensearch-sdk-java [2] https://github.com/opensearch-project/OpenSearch/issues/13516

ansjcy commented 6 months ago

This looks like a very valid use case of the generic multi-tenancy support in OpenSearch we are discussing (as reta mentioned)!

To distinguish that, we can open a setting field: plugins.ml_commons.independent_node which will be true for Multi-tenancy and false for Opensearch. Depending on this field we will mark tenant_id as mandatory/optional. We need this check for every transport and rest layer action.

This can be done as simply adding a rule-based tenancy labeller. We have a draft PR for a similar use case to attach the tenancy label based on the authenticated user: https://github.com/opensearch-project/OpenSearch/pull/13374/files#diff-b4d03a88895891abd177d233d20dec21a8c87ec97b7042d158afeb9729f7b300 . Please take a look at the Meta issue mentioned by reta and also this draft PR to provide any feedback!

dhrubo-os commented 6 months ago

During the meeting we have discussed if it would make sense to change ml-commons from plugin to extension [1] since it gives more freedom regarding the deployment / scalability / implementation aspects.

@dbwiddis Do you have any input here?

dbwiddis commented 6 months ago

During the meeting we have discussed if it would make sense to change ml-commons from plugin to extension [1] since it gives more freedom regarding the deployment / scalability / implementation aspects.

That's a potential end game, yes! However, the extensions sdk is not in a mature enough state to make that transition right now. We did get as far with extensions as identifying NamedRoutes and integrating with the security framework's token generation to pass around authenticated user information, so there's definitely some application of the work we previously did on extensions here.

More generally, I think this is part of an overall long term plan to separate data from code and logically separate things where we can continue to use them in a cluster environment while simultaneously enabling a faster/easier transition of data manipulation (compute, memory, storage, etc.) to other environments, both ones we can think of now and ones that may not yet exist...

opensearch-project / ml-commons

[RFC] Enhancement of Multi-Tenancy Capabilities in ML-Commons #2358

RFC: Enhancement of Multi-Tenancy in ML-Commons Plugin

1. Introduction

2. High-Level Context

Strategic Benefits of Multi-Tenancy:

3. Purpose

3. Design Considerations

4. Out of Scope

5. Proposed Architecture and Solutions

5.1. Resource Separation Between Tenants

5.2. Data Storage Models

5.3. Remote Data Storage

5.4. API Design and Interaction

5.5. Multi-tenancy use case identifier

6. API Support and Functionality

7. BWC & Impact to current Single cluster use case

7.1 Enhanced Flexibility with Backward Compatibility

7.2 Adaptive Access Control for Multi-Tenant Environments

7.3 Utilizing Tenant ID in Single Tenant Clusters

8. System Settings and Configuration

9. Conclusion