opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
95 stars 131 forks source link

[BUG] Cohere Blueprint issue #1344

Closed dtaivpp closed 1 year ago

dtaivpp commented 1 year ago

What is the bug? When ingesting data using Cohere's blueprint and neural search the ingestion pipeline returns:

{
  "took": 0,
  "ingest_took": 2,
  "errors": true,
  "items": [
    {
      "create": {
        "_index": "cohere-index",
        "_id": "1",
        "status": 400,
        "error": {
          "type": "illegal_argument_exception",
          "reason": "Must provide pre_process_function for predict action to process text docs input."
        }
      }
    },
  ]
}

How can one reproduce the bug? Steps to reproduce the behavior: DevTools reproduction:

# Create Connector: 
POST /_plugins/_ml/connectors/_create
{
   "name": "Cohere Connector",
   "description": "External connector for connections into Cohere",
   "version": "1.0",
   "protocol": "http",
   "credential": {
           "cohere_key": "<INSERT KEY>"
       },
    "parameters": {
      "model": "embed-english-v2.0",
      "truncate": "END"
    },
   "actions": [{
       "action_type": "predict",
       "method": "POST",
       "url": "https://api.cohere.ai/v1/embed",
       "headers": {
               "Authorization": "Bearer ${credential.cohere_key}"
           },
       "request_body": "{ \"texts\": ${parameters.texts}, \"truncate\": \"${parameters.truncate}\", \"model\": \"${parameters.model}\" }"
       }]
}

# Create a model related to the cluster
POST /_plugins/_ml/models/_register
{
    "name": "embed-english-v2.0",
    "function_name": "remote",
    "description": "test model",
    "connector_id": "9vxqm4oBYFYMFuLi9oOR"
}

# Deploy the model
POST /_plugins/_ml/models/_Pxrm4oBYFYMFuLiG4PH/_deploy

PUT _ingest/pipeline/cohere-ingest-pipeline
{
  "description": "Cohere Neural Search Pipeline",
  "processors" : [
    {
      "text_embedding": {
        "model_id": "_Pxrm4oBYFYMFuLiG4PH",
        "field_map": {
          "content": "content_embedding"
        }
      }
    }
  ]
}

# Create KNN index. Note* need to match space to model space. eg embed-english-v2.0 recommends cosine similarity
PUT /cohere-index
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "cohere-ingest-pipeline"
    },
    "mappings": {
        "properties": {
            "content_embedding": {
                "type": "knn_vector",
                "dimension": 4096,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib"
                }
            },
            "content": {
                "type": "text"
            }
        }
    }
}

POST /_bulk
{ "create" : { "_index" : "cohere-index", "_id" : "1" } }
{ "content":"Testing neural search"}
{ "create" : { "_index" : "cohere-index", "_id" : "2" } }
{ "content":"What are we doing"}
{ "create" : { "_index" : "cohere-index", "_id" : "3" } }
{ "content":"This should exist"}

GET /cohere-index/_search
{
  "query": {
    "bool" : {
      "should" : [
        {
          "script_score": {
              "neural": {
                "content_embedding": {
                  "query_text": "How do I ingest to opensearch",
                  "k": 10
              }
            },
            "script": {
              "source": "_score * 1.5"
            }
          }
        }
        ,
        {
          "script_score": {
            "query": {
              "match": { "content": "I want information about the new compression algorithems in OpenSearch" }
            },
            "script": {
              "source": "_score * 1.7"
            }
          }
        }
      ]
    }
  }
}

What is the expected behavior? Expect data to be ingested.

What is your host/environment?

ylwu-amzn commented 1 year ago

I tested this on 2.10 , it works

Step 1. Create model group

POST /_plugins/_ml/model_groups/_register
{
  "name": "my_remote_model_group_cohere",
  "description": "This is a test group"
}

response

{
    "model_group_id": "wySNm4oBRiMywALe-EvK",
    "status": "CREATED"
}

Step 2. Create connector

POST /_plugins/_ml/connectors/_create
{
    "name": "Cohere enbmedding",
    "description": "my test connector",
    "version": "1.0",
    "protocol": "http",
    "credential": {
        "cohere_key": "<your_cohere_key>"
    },
    "parameters": {
        "model": "embed-english-v2.0",
        "truncate": "END"
    },
    "actions": [
        {
            "action_type": "predict",
            "method": "POST",
            "url": "https://api.cohere.ai/v1/embed",
            "headers": {
                "Authorization": "Bearer ${credential.cohere_key}"
            },
            "request_body": "{ \"texts\": ${parameters.prompt}, \"truncate\": \"${parameters.truncate}\", \"model\": \"${parameters.model}\" }",
            "pre_process_function": "connector.pre_process.cohere.embedding",
            "post_process_function": "connector.post_process.cohere.embedding"
        }
    ]
}

The key part which is not in the blueprint

"pre_process_function": "connector.pre_process.cohere.embedding",
"post_process_function": "connector.post_process.cohere.embedding"

These two process functions are mandatory if you want to let the Cohere model to work with neural-search. Because the remote model input/output varies a lot, we need these two process functions to transform the input to fit into remote model and transform the output to fit into neural-search. Find more details in this doc

Response

{
    "connector_id": "2SSRm4oBRiMywALecEuh"
}

Step 3. Register model

POST /_plugins/_ml/models/_register?deploy=true
{
    "name": "Cohere embedding model",
    "function_name": "remote",
    "model_group_id": "wySNm4oBRiMywALe-EvK",
    "description": "test model",
    "connector_id": "2SSRm4oBRiMywALecEuh"
}

Response

{
    "task_id": "4CSSm4oBRiMywALeRktD",
    "status": "CREATED"
}

Then use get task API to find model id

GET /_plugins/_ml/tasks/4CSSm4oBRiMywALeRktD

response

{
    "model_id": "4SSSm4oBRiMywALeRktd",
    "task_type": "REGISTER_MODEL",
    "function_name": "REMOTE",
    "state": "COMPLETED",
    "worker_node": [
        "lkN3LiY3SfmR6DRUO7SR3Q"
    ],
    "create_time": 1694827169347,
    "last_update_time": 1694827169384,
    "is_async": false
}

Step 4. Predict

POST /_plugins/_ml/models/4SSSm4oBRiMywALeRktd/_predict
{
  "parameters": {
    "texts": ["Say this is a test"]
  }
}

Response

{
    "inference_results": [
        {
            "output": [
                {
                    "name": "sentence_embedding",
                    "data_type": "FLOAT32",
                    "shape": [
                        4096
                    ],
                    "data": [
                        -0.77246094,
                        -0.12927246,
                        -0.52490234,
                                                 ...
                    ]
                }
            ]
        }
    ]
}
dtaivpp commented 1 year ago

@ylwu-amzn okay this is working in 2.9 with the predict endpoint. Seems there is still an issue with ingestion as I am getting the following when using _bulk with the ingestion pipeline. Does the text_embedding step of the ingestion pipeline get created as a part of ML Commons or does that live elsewhere?

    {
      "create": {
        "_index": "cohere-index",
        "_id": "1",
        "status": 400,
        "error": {
          "type": "illegal_argument_exception",
          "reason": "Invalid JSON in payload"
        }
      }
    }
ylwu-amzn commented 1 year ago

Ingestion pipeline is in neural-search plugin. Can you share your knn index and ingestion pipeline setting? Need to reproduce this issue to debug

dtaivpp commented 1 year ago

Thanks for the debugging @ylwu-amzn! For future people who find this thread the issues have been resolved in the blueprint. There was a parameter that should have $parameters.texts but was incorrectly put into the java code as $parameters.prompt. I've updated the blueprint to reflect $parameters.prompt until the java code can be fixed.

The second issue was with the pre/post process functions that were missing in the templates. These have been added in PR #1351.