opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
99 stars 136 forks source link

[BUG]-(flaky tests) RestMLInferenceSearchResponseProcessorIT Model NOT Found Exception #3228

Open mingshl opened 3 days ago

mingshl commented 3 days ago

What is the bug? Need to confirm the model id is available after model creation tasks. Sometime the model creation is not completed then the later tests using the model id cannot find the model.

How can one reproduce the bug?

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField STANDARD_ERROR
    REPRODUCE WITH: ./gradlew ':opensearch-ml-plugin:integTest' --tests "org.opensearch.ml.rest.RestMLInferenceSearchResponseProcessorIT.testMLInferenceProcessorRemoteModelStringField" -Dtests.seed=9E7BCE94AFC0318E -Dtests.security.manager=false -Dtests.locale=luy-KE -Dtests.timezone=Asia/Dubai -Druntime.java=21

RestMLInferenceSearchResponseProcessorIT > testMLInferenceProcessorRemoteModelStringField FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:33169/], URI [/_plugins/_ml/models/null/_deploy], status line [HTTP/1.1 404 Not Found]
    {"error":{"root_cause":[{"type":"status_exception","reason":"Failed to find model"}],"type":"status_exception","reason":"Failed to find model"},"status":404}

What is your host/environment?

Do you have any screenshots? If applicable, add screenshots to help explain your problem.

Do you have any additional context? Add any other context about the problem.

brianf-aws commented 3 days ago

I was trying to take a look at this but I can't seem to understand a scenario where this doesn't work. it calls the following.

// In setup()
this.bedrockEmbeddingModelId = registerRemoteModel(bedrockEmbeddingModelConnectorEntity, bedrockEmbeddingModelName, true);

In the above method it will try to deploy the model.This is where the could not deploy model error comes from because it states that model_id is null. My only speculation of this happening is that the shape of the response is different compared to a high level key with model_id perhaps its nested on a different machine? https://github.com/opensearch-project/ml-commons/blob/05f78aff15dfc0955537fe150923af2e7ff3f4c3/plugin/src/test/java/org/opensearch/ml/rest/MLCommonsRestTestCase.java#L1009