jngz-es commented 2 weeks ago

Problem statement

Usually model inference is expensive, especially for large models.

Motivation

With the caching feature, we can

reduce the latency of model inference.
save the cost of model inference

Proposed Design

Phase 0

We allow user to enable cache feature for models.

Enable cache

POST /_plugins/_ml/models/_register?deploy=true
{
    "name": "anthropic.claude-v3",
    "function_name": "remote",
    "model_group_id": "<group id>",
    "description": "claude v3 model",
    "connector_id": "<connector id>",
    "cache_enabled": true,
    "cache_config": {
        "eviction_policy": "lru",
        "ttl": 600, # 600s
        "capacity": 1000
    }
}

All cache parameters are optional. By default the cache is disabled. If enabled cache, the system will try to get cache related config. If the config not present, the system will use default values.

Config parameters

eviction_policy - determines how to do eviction when the capacity limit is reached. ttl - determines how long an item is hold by the cache. capacity - the soft limit on the cache volume, it will be overridden by hard limit if it is large than hard limit.

Disable cache

PUT /_plugins/_ml/models/<model_id>
{
  "cache_enabled": false
}

Disabling cache for a model will remove all data associated with the model in the cache.

Update cache

PUT /_plugins/_ml/models/<model_id>
{
    "cache_config": {
        "eviction_policy": "lru",
        "ttl": 600, # 600s
        "capacity": 1000
    }
}

Storage

We leverage on OpenSearch index to store the data.

Cache key

Model id + model config + user input

Cleanup

We check ttl when getting the item from the cache, if expired, remove it.
To avoid a cache item never being deleted as a storage leak issue, we will have a periodical job to delete those expired items.
The above job evicts some data in terms of the policy as well.

Security

Leverage on the existing model permission control for cache access permission control.

Phase 1

We introduce new cache APIs as a cache service to

provide flexibility for applying cache in different cases, like combination with models, agents etc.
support remote cache.
support more cache features like set different ttl for different items in the cache etc.

API

Create cache

POST /_plugins/_ml/cache/_create
{
    "type": "Local/Remote",
    "name": "test cache",
    "description": "test cache",
    "connector": "connector_id" # required by remote type like Elasticache
    "config": {
        "eviction_policy": "lru",
        "ttl": 600,
        "capacity": 1000
    }
}

#Response
{
    "cache_id": "gW8Aa40BfUsSoeNTvOKI"
}

Get cache meta

# Get single cache meta data
GET /_plugins/_ml/cache/<cache_id>

#Response
{
    "cache_id": "gW8Aa40BfUsSoeNTvOKI",
    "type": "Local/Remote",
    "name": "test cache",
    "description": "test cache",
    "connector": "connector_id"
}

# Get all caches
GET /_plugins/_ml/cache

#Response
{
    "caches": [
        {
            "cache_id": "gW8Aa40BfUsSoeNTvOKI",
            "type": "Local/Remote",
            "name": "test cache",
            "description": "test cache",
            "connector": "connector_id"
        }
    ]
}

Delete cache

DELETE /_plugins/_ml/cache/<cache_id>

Cache set

PUT /_plugins/_ml/cache/<cache_id>/_set?ttl=600
{
    "key": "value (ex. model response)"
}

# multiple set
PUT /_plugins/_ml/cache/<cache_id>/_mset?ttl=600
{
    "key1": "value1",
    "key2": "value2",
    ...
}

Set stores create_time field automatically for ttl calculation.

Cache get

GET /_plugins/_ml/cache/<cache_id>/_get
{
    "key": "key name"
}

# multiple get
GET /_plugins/_ml/cache/<cache_id>/_mget
{
    "keys": [key1, key2, ...]
}

If ttl expires, return null and remove the key from the cache.

Cache delete

DELETE /_plugins/_ml/cache/<cache_id>/_delete
{
    "key": "key name"
}

# multiple delete
DELETE /_plugins/_ml/cache/<cache_id>/_mdelete
{
    "keys": [key1, key2, ...]
}

Cache types

Local cache

We build cache on top of OpenSearch index functionality. To simplify the design, we don’t introduce a new distributed cache like redis or memorycache into the cluster, we use OpenSearch index as the store for caching.

Remote cache

We leverage on our existing connector mechanism to access remote cache service such as Elasticache to build a remote cache for customers. We need a new type of connector for cache, not model. As we don’t need predict action, we need get/set actions on connector.

An example connector

POST /_plugins/_ml/connectors/_create
{
    "name": "Elasticache Connector",
    "description": "The connector to Elasticache",
    "version": 1,
    "protocol": "http",
    "parameters": {
        "host": "xxx.yyy.clustercfg.zzz1.cache.amazonaws.com",
        "port": "6379"
    },
    "credential": {
        "key": "..."
    },
    "actions": [
        {
            "action_type": "cache",
            "method": "get/put"
        }
    ]
}

Use case example

Cache for models

POST /_plugins/_ml/models/_register?deploy=true
{
    "name": "anthropic.claude-v3",
    "function_name": "remote",
    "model_group_id": "<group id>",
    "description": "claude v3 model",
    "connector_id": "<connector id>",
    "cache_id": "<cache id>"
}

brianf-aws commented 2 weeks ago

Hey Jing, this feature sounds amazing. I think if you can provide an example of how this could be used like caching an embedding.

Besides LRU what other eviction polices do you plan to implement?

zane-neo commented 1 week ago

@jngz-es, this feature looks good. Several questions:

Are we going to support exact match or semantic match? If only exact match supported, do we have expected hit rate on this?
Do we need to support enable/disable the cache reading on the fly? E.g. I might don't want cache data for a question because I'm seeking a different answer.
Do we need to add user_id(if user_id shows up) in the cache key to avoid leaking private data?

opensearch-project / ml-commons

[RFC] Model Inference Caches #3055