opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
99 stars 136 forks source link

[BUG] model deployment fails -- Could not initialize class ai.djl.onnxruntime.engine.OrtNDManager #3207

Open jovanovic-milos opened 1 week ago

jovanovic-milos commented 1 week ago

What is the bug? Deployment of model is failing because of what seems to be an exception in ml-commons.

How can one reproduce the bug? Steps to reproduce the behavior:

  1. Prepare multilingual-e5-large model with optimum export (https://huggingface.co/intfloat/multilingual-e5-large)
  2. ZIP model directory
  3. Register the model to OpenSearch via API
  4. Deploy the model
  5. Check OpenSearch logs (sometimes connection timed out error pops up too, in this case i just try to deploy the model again)

What is the expected behavior? Successful deployment of the model

What is your host/environment? OpenSearch 2.18 running in Docker

Do you have any additional context? org.opensearch.ml.common.exception.MLException: Failed to deploy model w1BJEpMBbOORGaoAR7h5 2024-11-09T19:29:46.547698532Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:300) ~[?:?] 2024-11-09T19:29:46.547704056Z at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) ~[?:?] 2024-11-09T19:29:46.547708040Z at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:252) ~[?:?] 2024-11-09T19:29:46.547723453Z at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:142) ~[?:?] 2024-11-09T19:29:46.547727230Z at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) ~[?:?] 2024-11-09T19:29:46.547730758Z at org.opensearch.ml.model.MLModelManager.lambda$deployModel$52(MLModelManager.java:1083) ~[?:?] 2024-11-09T19:29:46.547734525Z at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547738193Z at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$73(MLModelManager.java:1703) [opensearch-ml-2.17.0.0.jar:2.17.0.0] 2024-11-09T19:29:46.547741754Z at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547745270Z at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547748852Z at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547752467Z at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547755951Z at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?] 2024-11-09T19:29:46.547759414Z at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?] 2024-11-09T19:29:46.547762898Z at java.base/java.lang.Thread.run(Thread.java:1583) [?:?] 2024-11-09T19:29:46.547766339Z Caused by: java.lang.NoClassDefFoundError: Could not initialize class ai.djl.onnxruntime.engine.OrtNDManager 2024-11-09T19:29:46.547769823Z at ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(OrtEngine.java:134) ~[?:?] 2024-11-09T19:29:46.547773286Z at ai.djl.onnxruntime.engine.OrtEngine.newModel(OrtEngine.java:122) ~[?:?] 2024-11-09T19:29:46.547779006Z at ai.djl.Model.newInstance(Model.java:99) ~[?:?] 2024-11-09T19:29:46.547782609Z at ai.djl.repository.zoo.BaseModelLoader.createModel(BaseModelLoader.java:196) ~[?:?] 2024-11-09T19:29:46.547786115Z at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:159) ~[?:?] 2024-11-09T19:29:46.547789621Z at ai.djl.repository.zoo.Criteria.loadModel(Criteria.java:174) ~[?:?] 2024-11-09T19:29:46.547795624Z at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:217) ~[?:?] 2024-11-09T19:29:46.547801105Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) ~[?:?] 2024-11-09T19:29:46.547804633Z ... 14 more 2024-11-09T19:29:46.547808106Z Caused by: java.lang.ExceptionInInitializerError: Exception ai.djl.engine.EngineException: Failed to save pytorch index file [in thread "opensearch[opensearch-node][opensearch_ml_deploy][T#7]"] 2024-11-09T19:29:46.547813577Z at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:429) ~[?:?] 2024-11-09T19:29:46.547822391Z at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:314) ~[?:?] 2024-11-09T19:29:46.547826200Z at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:93) ~[?:?] 2024-11-09T19:29:46.547829717Z at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:81) ~[?:?] 2024-11-09T19:29:46.547833234Z at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[?:?] 2024-11-09T19:29:46.547836783Z at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41) ~[?:?] 2024-11-09T19:29:46.547840279Z at ai.djl.engine.Engine.getEngine(Engine.java:190) ~[?:?] 2024-11-09T19:29:46.547843698Z at ai.djl.engine.Engine.getInstance(Engine.java:145) ~[?:?] 2024-11-09T19:29:46.547847149Z at ai.djl.onnxruntime.engine.OrtEngine.getAlternativeEngine(OrtEngine.java:75) ~[?:?] 2024-11-09T19:29:46.547850623Z at ai.djl.ndarray.BaseNDManager.<init>(BaseNDManager.java:64) ~[?:?] 2024-11-09T19:29:46.547854324Z at ai.djl.onnxruntime.engine.OrtNDManager.<init>(OrtNDManager.java:42) ~[?:?] 2024-11-09T19:29:46.547858210Z at ai.djl.onnxruntime.engine.OrtNDManager.<init>(OrtNDManager.java:35) ~[?:?] 2024-11-09T19:29:46.547861911Z at ai.djl.onnxruntime.engine.OrtNDManager$SystemManager.<init>(OrtNDManager.java:177) ~[?:?] 2024-11-09T19:29:46.547865450Z at ai.djl.onnxruntime.engine.OrtNDManager.<clinit>(OrtNDManager.java:37) ~[?:?] 2024-11-09T19:29:46.547869043Z at ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(OrtEngine.java:134) ~[?:?] 2024-11-09T19:29:46.547872635Z at ai.djl.onnxruntime.engine.OrtEngine.newModel(OrtEngine.java:122) ~[?:?] 2024-11-09T19:29:46.547876120Z at ai.djl.Model.newInstance(Model.java:99) ~[?:?] 2024-11-09T19:29:46.547879582Z at ai.djl.repository.zoo.BaseModelLoader.createModel(BaseModelLoader.java:196) ~[?:?] 2024-11-09T19:29:46.547884022Z at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:159) ~[?:?] 2024-11-09T19:29:46.547887604Z at ai.djl.repository.zoo.Criteria.loadModel(Criteria.java:174) ~[?:?] 2024-11-09T19:29:46.547891131Z at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:217) ~[?:?] 2024-11-09T19:29:46.547894789Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) ~[?:?] 2024-11-09T19:29:46.547898415Z ... 14 more

mingshl commented 3 days ago

@jovanovic-milos can you please share the command how you register the model? we need to reproduce the issue. Please let us know the model type that you used. thanks