opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
91 stars 129 forks source link

[BUG] Registering Pretrained Opensearch Model Fails Due to Null Model Config #2981

Open mohamey opened 1 week ago

mohamey commented 1 week ago

What is the bug? A clear and concise description of the bug.

On Opensearch 2.11.0, attempting to register Opensearch Pretrained models like amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill fails with the below error:

{
  "task_type": "REGISTER_MODEL",
  "function_name": "SPARSE_TOKENIZE",
  "state": "FAILED",
  "worker_node": [
    "<redacted>"
  ],
  "create_time": 1727095445453,
  "last_update_time": 1727095445784,
  "error": "model config is null",
  "is_async": true
}

The request used is:

POST /_plugins/_ml/models/_register?deploy=true
{
  "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill",
  "function_name": "SPARSE_TOKENIZE",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT",
  "model_content_hash_value": "86bab435d031edb2a6d921fd9ac317a7541d5d95666f642b606e7d0ebfb84358",
  "description": "This is a neural sparse encoding model: It transfers text into sparse vector, and then extract nonzero index and value to entry and weights. It serves only in ingestion and customer should use tokenizer model in query."
}

model config is null only appears in two places in the codebase, but I suspect it's coming from this class. However, I don't think the above requests matches the conditions for this error - which is what leads me to believe this might be a bug.

Can you confirm? And if yes, maybe you can provide some context on how it should work and I'm happy to submit a fix.

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Provision a new Opensearch 2.11 server, this bug was verified by myself on both the docker image & a managed Opensearch cluster provided by AWS

  2. Configure the ML Plugin using the following:

    {
    "persistent": {
    "plugins.ml_commons.only_run_on_ml_node": false,
    "plugins.ml_commons.native_memory_threshold": "99",
    "plugins.ml_commons.model_access_control_enabled": "true"
    }
    }
  3. Make the following request to register a pretrained OS model (which should not require a URL as per docs):

    POST /_plugins/_ml/models/_register?deploy=true
    {
    "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill",
    "function_name": "SPARSE_TOKENIZE",
    "version": "1.0.0",
    "model_format": "TORCH_SCRIPT",
    "model_content_hash_value": "86bab435d031edb2a6d921fd9ac317a7541d5d95666f642b606e7d0ebfb84358",
    "description": "This is a neural sparse encoding model: It transfers text into sparse vector, and then extract nonzero index and value to entry and weights. It serves only in ingestion and customer should use tokenizer model in query."
    }
  4. Get the task ID from the response in Step 3, and check it's status using `GET /_plugins/_ml/tasks/:task_id

  5. Observe the error which states model config is null

What is the expected behavior? The requested model should register and deploy successfully.

What is your host/environment?

Do you have any screenshots? N/A

Do you have any additional context? Largely been following this tutorial provided by the Opensearch Docs.

dhrubo-os commented 1 week ago

Not sure why do you need to provide model_content_hash_value in the request. Can you try this example?

@xinyual could you please look into this issue?