ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
99
stars
136
forks
source link
[BUG] model deployment fails -- Could not initialize class ai.djl.onnxruntime.engine.OrtNDManager #3207
Check OpenSearch logs (sometimes connection timed out error pops up too, in this case i just try to deploy the model again)
What is the expected behavior?
Successful deployment of the model
What is your host/environment?
OpenSearch 2.18 running in Docker
Do you have any additional context?org.opensearch.ml.common.exception.MLException: Failed to deploy model w1BJEpMBbOORGaoAR7h5 2024-11-09T19:29:46.547698532Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:300) ~[?:?] 2024-11-09T19:29:46.547704056Z at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) ~[?:?] 2024-11-09T19:29:46.547708040Z at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:252) ~[?:?] 2024-11-09T19:29:46.547723453Z at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:142) ~[?:?] 2024-11-09T19:29:46.547727230Z at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) ~[?:?] 2024-11-09T19:29:46.547730758Z at org.opensearch.ml.model.MLModelManager.lambda$deployModel$52(MLModelManager.java:1083) ~[?:?] 2024-11-09T19:29:46.547734525Z at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547738193Z at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$73(MLModelManager.java:1703) [opensearch-ml-2.17.0.0.jar:2.17.0.0] 2024-11-09T19:29:46.547741754Z at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547745270Z at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547748852Z at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547752467Z at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547755951Z at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?] 2024-11-09T19:29:46.547759414Z at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?] 2024-11-09T19:29:46.547762898Z at java.base/java.lang.Thread.run(Thread.java:1583) [?:?] 2024-11-09T19:29:46.547766339Z Caused by: java.lang.NoClassDefFoundError: Could not initialize class ai.djl.onnxruntime.engine.OrtNDManager 2024-11-09T19:29:46.547769823Z at ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(OrtEngine.java:134) ~[?:?] 2024-11-09T19:29:46.547773286Z at ai.djl.onnxruntime.engine.OrtEngine.newModel(OrtEngine.java:122) ~[?:?] 2024-11-09T19:29:46.547779006Z at ai.djl.Model.newInstance(Model.java:99) ~[?:?] 2024-11-09T19:29:46.547782609Z at ai.djl.repository.zoo.BaseModelLoader.createModel(BaseModelLoader.java:196) ~[?:?] 2024-11-09T19:29:46.547786115Z at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:159) ~[?:?] 2024-11-09T19:29:46.547789621Z at ai.djl.repository.zoo.Criteria.loadModel(Criteria.java:174) ~[?:?] 2024-11-09T19:29:46.547795624Z at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:217) ~[?:?] 2024-11-09T19:29:46.547801105Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) ~[?:?] 2024-11-09T19:29:46.547804633Z ... 14 more 2024-11-09T19:29:46.547808106Z Caused by: java.lang.ExceptionInInitializerError: Exception ai.djl.engine.EngineException: Failed to save pytorch index file [in thread "opensearch[opensearch-node][opensearch_ml_deploy][T#7]"] 2024-11-09T19:29:46.547813577Z at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:429) ~[?:?] 2024-11-09T19:29:46.547822391Z at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:314) ~[?:?] 2024-11-09T19:29:46.547826200Z at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:93) ~[?:?] 2024-11-09T19:29:46.547829717Z at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:81) ~[?:?] 2024-11-09T19:29:46.547833234Z at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[?:?] 2024-11-09T19:29:46.547836783Z at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41) ~[?:?] 2024-11-09T19:29:46.547840279Z at ai.djl.engine.Engine.getEngine(Engine.java:190) ~[?:?] 2024-11-09T19:29:46.547843698Z at ai.djl.engine.Engine.getInstance(Engine.java:145) ~[?:?] 2024-11-09T19:29:46.547847149Z at ai.djl.onnxruntime.engine.OrtEngine.getAlternativeEngine(OrtEngine.java:75) ~[?:?] 2024-11-09T19:29:46.547850623Z at ai.djl.ndarray.BaseNDManager.<init>(BaseNDManager.java:64) ~[?:?] 2024-11-09T19:29:46.547854324Z at ai.djl.onnxruntime.engine.OrtNDManager.<init>(OrtNDManager.java:42) ~[?:?] 2024-11-09T19:29:46.547858210Z at ai.djl.onnxruntime.engine.OrtNDManager.<init>(OrtNDManager.java:35) ~[?:?] 2024-11-09T19:29:46.547861911Z at ai.djl.onnxruntime.engine.OrtNDManager$SystemManager.<init>(OrtNDManager.java:177) ~[?:?] 2024-11-09T19:29:46.547865450Z at ai.djl.onnxruntime.engine.OrtNDManager.<clinit>(OrtNDManager.java:37) ~[?:?] 2024-11-09T19:29:46.547869043Z at ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(OrtEngine.java:134) ~[?:?] 2024-11-09T19:29:46.547872635Z at ai.djl.onnxruntime.engine.OrtEngine.newModel(OrtEngine.java:122) ~[?:?] 2024-11-09T19:29:46.547876120Z at ai.djl.Model.newInstance(Model.java:99) ~[?:?] 2024-11-09T19:29:46.547879582Z at ai.djl.repository.zoo.BaseModelLoader.createModel(BaseModelLoader.java:196) ~[?:?] 2024-11-09T19:29:46.547884022Z at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:159) ~[?:?] 2024-11-09T19:29:46.547887604Z at ai.djl.repository.zoo.Criteria.loadModel(Criteria.java:174) ~[?:?] 2024-11-09T19:29:46.547891131Z at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:217) ~[?:?] 2024-11-09T19:29:46.547894789Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) ~[?:?] 2024-11-09T19:29:46.547898415Z ... 14 more
@jovanovic-milos can you please share the command how you register the model? we need to reproduce the issue. Please let us know the model type that you used. thanks
What is the bug? Deployment of model is failing because of what seems to be an exception in ml-commons.
How can one reproduce the bug? Steps to reproduce the behavior:
What is the expected behavior? Successful deployment of the model
What is your host/environment? OpenSearch 2.18 running in Docker
Do you have any additional context?
org.opensearch.ml.common.exception.MLException: Failed to deploy model w1BJEpMBbOORGaoAR7h5 2024-11-09T19:29:46.547698532Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:300) ~[?:?] 2024-11-09T19:29:46.547704056Z at java.base/java.security.AccessController.doPrivileged(AccessController.java:571) ~[?:?] 2024-11-09T19:29:46.547708040Z at org.opensearch.ml.engine.algorithms.DLModel.loadModel(DLModel.java:252) ~[?:?] 2024-11-09T19:29:46.547723453Z at org.opensearch.ml.engine.algorithms.DLModel.initModel(DLModel.java:142) ~[?:?] 2024-11-09T19:29:46.547727230Z at org.opensearch.ml.engine.MLEngine.deploy(MLEngine.java:125) ~[?:?] 2024-11-09T19:29:46.547730758Z at org.opensearch.ml.model.MLModelManager.lambda$deployModel$52(MLModelManager.java:1083) ~[?:?] 2024-11-09T19:29:46.547734525Z at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547738193Z at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$73(MLModelManager.java:1703) [opensearch-ml-2.17.0.0.jar:2.17.0.0] 2024-11-09T19:29:46.547741754Z at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547745270Z at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547748852Z at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547752467Z at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.17.0.jar:2.17.0] 2024-11-09T19:29:46.547755951Z at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?] 2024-11-09T19:29:46.547759414Z at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?] 2024-11-09T19:29:46.547762898Z at java.base/java.lang.Thread.run(Thread.java:1583) [?:?] 2024-11-09T19:29:46.547766339Z Caused by: java.lang.NoClassDefFoundError: Could not initialize class ai.djl.onnxruntime.engine.OrtNDManager 2024-11-09T19:29:46.547769823Z at ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(OrtEngine.java:134) ~[?:?] 2024-11-09T19:29:46.547773286Z at ai.djl.onnxruntime.engine.OrtEngine.newModel(OrtEngine.java:122) ~[?:?] 2024-11-09T19:29:46.547779006Z at ai.djl.Model.newInstance(Model.java:99) ~[?:?] 2024-11-09T19:29:46.547782609Z at ai.djl.repository.zoo.BaseModelLoader.createModel(BaseModelLoader.java:196) ~[?:?] 2024-11-09T19:29:46.547786115Z at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:159) ~[?:?] 2024-11-09T19:29:46.547789621Z at ai.djl.repository.zoo.Criteria.loadModel(Criteria.java:174) ~[?:?] 2024-11-09T19:29:46.547795624Z at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:217) ~[?:?] 2024-11-09T19:29:46.547801105Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) ~[?:?] 2024-11-09T19:29:46.547804633Z ... 14 more 2024-11-09T19:29:46.547808106Z Caused by: java.lang.ExceptionInInitializerError: Exception ai.djl.engine.EngineException: Failed to save pytorch index file [in thread "opensearch[opensearch-node][opensearch_ml_deploy][T#7]"] 2024-11-09T19:29:46.547813577Z at ai.djl.pytorch.jni.LibUtils.downloadPyTorch(LibUtils.java:429) ~[?:?] 2024-11-09T19:29:46.547822391Z at ai.djl.pytorch.jni.LibUtils.findNativeLibrary(LibUtils.java:314) ~[?:?] 2024-11-09T19:29:46.547826200Z at ai.djl.pytorch.jni.LibUtils.getLibTorch(LibUtils.java:93) ~[?:?] 2024-11-09T19:29:46.547829717Z at ai.djl.pytorch.jni.LibUtils.loadLibrary(LibUtils.java:81) ~[?:?] 2024-11-09T19:29:46.547833234Z at ai.djl.pytorch.engine.PtEngine.newInstance(PtEngine.java:53) ~[?:?] 2024-11-09T19:29:46.547836783Z at ai.djl.pytorch.engine.PtEngineProvider.getEngine(PtEngineProvider.java:41) ~[?:?] 2024-11-09T19:29:46.547840279Z at ai.djl.engine.Engine.getEngine(Engine.java:190) ~[?:?] 2024-11-09T19:29:46.547843698Z at ai.djl.engine.Engine.getInstance(Engine.java:145) ~[?:?] 2024-11-09T19:29:46.547847149Z at ai.djl.onnxruntime.engine.OrtEngine.getAlternativeEngine(OrtEngine.java:75) ~[?:?] 2024-11-09T19:29:46.547850623Z at ai.djl.ndarray.BaseNDManager.<init>(BaseNDManager.java:64) ~[?:?] 2024-11-09T19:29:46.547854324Z at ai.djl.onnxruntime.engine.OrtNDManager.<init>(OrtNDManager.java:42) ~[?:?] 2024-11-09T19:29:46.547858210Z at ai.djl.onnxruntime.engine.OrtNDManager.<init>(OrtNDManager.java:35) ~[?:?] 2024-11-09T19:29:46.547861911Z at ai.djl.onnxruntime.engine.OrtNDManager$SystemManager.<init>(OrtNDManager.java:177) ~[?:?] 2024-11-09T19:29:46.547865450Z at ai.djl.onnxruntime.engine.OrtNDManager.<clinit>(OrtNDManager.java:37) ~[?:?] 2024-11-09T19:29:46.547869043Z at ai.djl.onnxruntime.engine.OrtEngine.newBaseManager(OrtEngine.java:134) ~[?:?] 2024-11-09T19:29:46.547872635Z at ai.djl.onnxruntime.engine.OrtEngine.newModel(OrtEngine.java:122) ~[?:?] 2024-11-09T19:29:46.547876120Z at ai.djl.Model.newInstance(Model.java:99) ~[?:?] 2024-11-09T19:29:46.547879582Z at ai.djl.repository.zoo.BaseModelLoader.createModel(BaseModelLoader.java:196) ~[?:?] 2024-11-09T19:29:46.547884022Z at ai.djl.repository.zoo.BaseModelLoader.loadModel(BaseModelLoader.java:159) ~[?:?] 2024-11-09T19:29:46.547887604Z at ai.djl.repository.zoo.Criteria.loadModel(Criteria.java:174) ~[?:?] 2024-11-09T19:29:46.547891131Z at org.opensearch.ml.engine.algorithms.DLModel.doLoadModel(DLModel.java:217) ~[?:?] 2024-11-09T19:29:46.547894789Z at org.opensearch.ml.engine.algorithms.DLModel.lambda$loadModel$1(DLModel.java:286) ~[?:?] 2024-11-09T19:29:46.547898415Z ... 14 more