Open ArranDengate-Netapp opened 7 months ago
Thanks for opening up the issue @ArranDengate-Netapp. I was poking around with the demo[1], I see its using a pre-trained model being downloaded from artifacts.opensearch.org and obviously it just downloads the model but does not have an engine (Eg. pytorch-engine)
I see 3 feature enhancements:
Under this circumstance, GET /_plugins/_ml/models/
tells us the deploy failed, but does not provide a reason. (Not sure if the task API would provide more info - I couldn't see how to get opensearch-py-ml to give me the task ID.)
It should be fairly straight forward to get task ID[4] from deploy model, the deploy model API returns a Task ID which you could query through the Task API[5].
That said, I am fairly new to this repo. I'd like to hear thoughts from other maintainers who are pretty active @ylwu-amzn @austintlee @HenryL27 .
[1] https://opensearch-project.github.io/opensearch-py-ml/examples/demo_ml_commons_integration.html [2] https://github.com/opensearch-project/ml-commons/tree/main/docs [3] https://opensearch.org/docs/latest/ml-commons-plugin/ [4] https://opensearch-project.github.io/opensearch-py-ml/examples/demo_ml_commons_integration.html#Step-2:-Load-Model [5] https://opensearch.org/docs/latest/ml-commons-plugin/api/tasks-apis/index/
@ArranDengate-Netapp, thanks for cutting this issue.
We thought about this use case (cluster has no access to network) when we build the feature. One option we considered is bundling the dependencies to OpenSearch release, the challenge is we need to consider different hardware, different versions, also that will make the OpenSearch size much bigger. We didn't find other good options, so we did not prioritize this use case. We can pick up this topic and have more discussion, welcome any comments/suggestions.
One workaround :
@ylwu-amzn I see, that's a difficult tradeoff.
That workaround sounds good! I would like to check:
copy dependency from the test cluster to the production cluster
- would that just be the contents of the ml_cache directory? (eg, for the RPM install of OpenSearch: /var/lib/opensearch/ml_cache
?)(Oops, didn't mean to close...)
@ArranDengate-Netapp Try adding a proxy and see if it works
Step 1: Edit /etc/sysconfig/opensearch Step 2: Add line OPENSEARCH_JAVA_OPTS="-Dhttp.proxyHost=YOURPROXY -Dhttp.proxyPort=YOURPORT -Dhttps.proxyHost=YOURPROXY -Dhttps.proxyPort=YOURPORT -Dhttp.nonProxyHosts=localhost|127.0.0.1|10...|.local" Step 3: Restart cluster
Let me know if it worked for you.
Hi @brunowcs ,
Wow, I didn't realise Java had built-in proxy support!
I don't think this approach will work for us, but this could be a useful workaround for other people affected by this issue. I am involved with two use-cases:
@ylwu-amzn I see, that's a difficult tradeoff.
That workaround sounds good! I would like to check:
- when you say
copy dependency from the test cluster to the production cluster
- would that just be the contents of the ml_cache directory? (eg, for the RPM install of OpenSearch:/var/lib/opensearch/ml_cache
?)- once the model has been uploaded and deployed, nothing else will need to be downloaded later, right? (That would make sense, I just want to confirm)
- does the ML cache ever get cleared, in such a way that we would need to re-download the model?
For question1, yes, just copy the whole ml_cache
directory
For question2, correct, nothing else
For question3, no, unless you manually delete the local cached file.
Hey, following up and merging #2165 into it. So running a node completely without internet is possible.
We are running our server in aws in a public private vpc setting. OS is in the private one and has no access. We are registering a model group and uploading our model into it.
This way OS needs no access at all.
What we can see is that even if we set offline flags for the underlying libraries they still try to download certain models. (Every time)
Only after all this failed, the whole tasks fails and a second later switches to deployed.
We have it quite often in our aws instance that the model gets dropped. Don't know why yet. Sometimes when I trigger the redeploy (we have it built into our app now, that if model gone, kick redeploy..) that after 10min, download tries, fail it gets back to deployed.
Sometimes I have to clean the model cache and upload it again.
We are on 2.12 right now. Trying to get to 2.14.
So the use case in general:
The last approach is what we do with vllm. We mount the model into the container, so it won't download it.
Btw just register a local model with a zip uploaded to the clusters filesystem. afterwards registering the model with the local file url. It worked nicely, but like before seeing things like: [2024-06-11T11:26:53,365][WARN ][a.d.h.z.HfModelZoo ] [ip-10-0-150-245.eu-central-1.compute.internal] Failed to download Huggingface model zoo index: NLP.FILL_MASK [2024-06-11T11:29:04,437][WARN ][a.d.h.z.HfModelZoo ] [ip-10-0-150-245.eu-central-1.compute.internal] Failed to download Huggingface model zoo index: NLP.QUESTION_ANSWER
Only after it failed, the model will be deployed.
` 11 June 2024 at 13:35 (UTC+2:00) | [2024-06-11T11:35:50,299][INFO ][o.o.m.c.MLSyncUpCron ] [***.compute.internal] Refresh model state: {ZBQEB5ABZYX4tphUswwV=DEPLOYED} | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
---|---|---|---|
11 June 2024 at 13:35 (UTC+2:00) | [2024-06-11T11:35:42,098][ERROR][o.o.m.a.d.TransportDeployModelOnNodeAction] [***.compute.internal] Deploy model task failed: ZRQKB5ABZYX4tphUpwzG | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | org.opensearch.transport.RemoteTransportException: [.compute.internal][][cluster:admin/opensearch/mlinternal/forward] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | Caused by: java.lang.NullPointerException: Cannot invoke "org.opensearch.ml.task.MLTaskCache.getMlTask()" because "mlTaskCache" is null | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.ml.action.forward.TransportForwardAction.doExecute(TransportForwardAction.java:121) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:218) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:118) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.indexmanagement.controlcenter.notification.filter.IndexOperationActionFilter.apply(IndexOperationActionFilter.kt:39) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:77) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.TransportAction.execute(TransportAction.java:188) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:102) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:98) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:114) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService.sendLocalRequest(TransportService.java:1053) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService$3.sendRequest(TransportService.java:161) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService.sendRequestInternal(TransportService.java:989) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService.sendRequestAsync(TransportService.java:1746) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService.sendRequest(TransportService.java:885) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService.sendRequest(TransportService.java:844) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.ml.action.deploy.TransportDeployModelOnNodeAction.lambda$createDeployModelNodeResponse$2(TransportDeployModelOnNodeAction.java:167) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) ~[opensearch-core-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) ~[opensearch-core-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.ml.model.MLModelManager.lambda$deployModel$51(MLModelManager.java:1030) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$72(MLModelManager.java:1553) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at java.base/java.lang.Thread.run(Thread.java:840) [?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService.sendLocalRequest(TransportService.java:1053) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService$3.sendRequest(TransportService.java:161) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService.sendRequestInternal(TransportService.java:989) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService.sendRequestAsync(TransportService.java:1746) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService.sendRequest(TransportService.java:885) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.transport.TransportService.sendRequest(TransportService.java:844) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.ml.action.deploy.TransportDeployModelOnNodeAction.lambda$createDeployModelNodeResponse$2(TransportDeployModelOnNodeAction.java:167) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) ~[opensearch-core-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) ~[opensearch-core-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.ml.model.MLModelManager.lambda$deployModel$51(MLModelManager.java:1030) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$72(MLModelManager.java:1553) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at java.base/java.lang.Thread.run(Thread.java:840) [?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | [2024-06-11T11:35:42,096][ERROR][o.o.m.a.f.TransportForwardAction] [***.compute.internal] Failed to execute forward action DEPLOY_MODEL_DONE | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | java.lang.NullPointerException: Cannot invoke "org.opensearch.ml.task.MLTaskCache.getMlTask()" because "mlTaskCache" is null | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.ml.action.forward.TransportForwardAction.doExecute(TransportForwardAction.java:121) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:218) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:118) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.indexmanagement.controlcenter.notification.filter.IndexOperationActionFilter.apply(IndexOperationActionFilter.kt:39) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:77) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.TransportAction.execute(TransportAction.java:188) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:102) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:98) ~[opensearch-2.12.0.jar:2.12.0] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:114) ~[?:?] | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | [2024-06-11T11:35:42,036][INFO ][o.o.m.e.a.DLModel ] [***.compute.internal] Model ZBQEB5ABZYX4tphUswwV is successfully deployed on 1 devices | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | [2024-06-11T11:35:39,057][INFO ][a.d.p.e.PtEngine ] [***.compute.internal] Number of inter-op threads is 1 | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | [2024-06-11T11:35:39,058][INFO ][a.d.p.e.PtEngine ] [***.compute.internal] Number of intra-op threads is 1 | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) | [2024-06-11T11:35:37,657][WARN ][a.d.h.z.HfModelZoo ] [***.compute.internal] Failed to download Huggingface model zoo index: NLP.TOKEN_CLASSIFICATION | a3b0c66178584897bdc5bd8d170235a9 | ragtime-opensearch |
11 June 2024 at 13:35 (UTC+2:00) [2024-06-11T11:35:50,299][INFO ][o.o.m.c.MLSyncUpCron ] [.compute.internal] Refresh model state: {ZBQEB5ABZYX4tphUswwV=DEPLOYED} [a3b0c66178584897bdc5bd8d170235a9]() ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) [2024-06-11T11:35:42,098][ERROR][o.o.m.a.d.TransportDeployModelOnNodeAction] [.compute.internal] Deploy model task failed: ZRQKB5ABZYX4tphUpwzG [a3b0c66178584897bdc5bd8d170235a9]() ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) org.opensearch.transport.RemoteTransportException: [.compute.internal][][cluster:admin/opensearch/mlinternal/forward] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) Caused by: java.lang.NullPointerException: Cannot invoke "org.opensearch.ml.task.MLTaskCache.getMlTask()" because "mlTaskCache" is null a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.ml.action.forward.TransportForwardAction.doExecute(TransportForwardAction.java:121) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:218) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:118) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.indexmanagement.controlcenter.notification.filter.IndexOperationActionFilter.apply(IndexOperationActionFilter.kt:39) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:77) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.TransportAction.execute(TransportAction.java:188) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:102) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:98) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:114) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService.sendLocalRequest(TransportService.java:1053) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService$3.sendRequest(TransportService.java:161) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService.sendRequestInternal(TransportService.java:989) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService.sendRequestAsync(TransportService.java:1746) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService.sendRequest(TransportService.java:885) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService.sendRequest(TransportService.java:844) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.ml.action.deploy.TransportDeployModelOnNodeAction.lambda$createDeployModelNodeResponse$2(TransportDeployModelOnNodeAction.java:167) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) ~[opensearch-core-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) ~[opensearch-core-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.ml.model.MLModelManager.lambda$deployModel$51(MLModelManager.java:1030) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$72(MLModelManager.java:1553) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at java.base/java.lang.Thread.run(Thread.java:840) [?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService.sendLocalRequest(TransportService.java:1053) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService$3.sendRequest(TransportService.java:161) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService.sendRequestInternal(TransportService.java:989) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService.sendRequestAsync(TransportService.java:1746) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService.sendRequest(TransportService.java:885) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.transport.TransportService.sendRequest(TransportService.java:844) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.ml.action.deploy.TransportDeployModelOnNodeAction.lambda$createDeployModelNodeResponse$2(TransportDeployModelOnNodeAction.java:167) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) ~[opensearch-core-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) ~[opensearch-core-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.ml.model.MLModelManager.lambda$deployModel$51(MLModelManager.java:1030) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) ~[opensearch-core-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.ml.model.MLModelManager.lambda$retrieveModelChunks$72(MLModelManager.java:1553) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.ThreadedActionListener$1.doRun(ThreadedActionListener.java:78) [opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at java.base/java.lang.Thread.run(Thread.java:840) [?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) [2024-06-11T11:35:42,096][ERROR][o.o.m.a.f.TransportForwardAction] [.compute.internal] Failed to execute forward action DEPLOY_MODEL_DONE [a3b0c66178584897bdc5bd8d170235a9]() ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) java.lang.NullPointerException: Cannot invoke "org.opensearch.ml.task.MLTaskCache.getMlTask()" because "mlTaskCache" is null a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.ml.action.forward.TransportForwardAction.doExecute(TransportForwardAction.java:121) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:218) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.indexmanagement.rollup.actionfilter.FieldCapsFilter.apply(FieldCapsFilter.kt:118) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.indexmanagement.controlcenter.notification.filter.IndexOperationActionFilter.apply(IndexOperationActionFilter.kt:39) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionFilter.apply(PerformanceAnalyzerActionFilter.java:77) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:216) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.TransportAction.execute(TransportAction.java:188) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:102) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:98) ~[opensearch-2.12.0.jar:2.12.0] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:114) ~[?:?] a3b0c66178584897bdc5bd8d170235a9 ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) [2024-06-11T11:35:42,036][INFO ][o.o.m.e.a.DLModel ] [.compute.internal] Model ZBQEB5ABZYX4tphUswwV is successfully deployed on 1 devices [a3b0c66178584897bdc5bd8d170235a9]() ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) [2024-06-11T11:35:39,057][INFO ][a.d.p.e.PtEngine ] [.compute.internal] Number of inter-op threads is 1 [a3b0c66178584897bdc5bd8d170235a9]() ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) [2024-06-11T11:35:39,058][INFO ][a.d.p.e.PtEngine ] [.compute.internal] Number of intra-op threads is 1 [a3b0c66178584897bdc5bd8d170235a9]() ragtime-opensearch 11 June 2024 at 13:35 (UTC+2:00) [2024-06-11T11:35:37,657][WARN ][a.d.h.z.HfModelZoo ] [.compute.internal] Failed to download Huggingface model zoo index: NLP.TOKEN_CLASSIFICATION [a3b0c66178584897bdc5bd8d170235a9]() ragtime-opensearch`
For clusters in a corporate setting, internet access is often restricted with an egress firewall.
However, the ML commons plugin needs internet access to download dependencies, even when using a local model.
It would be good to improve the user experience in this situation. Some ideas:
I see this behaviour when using the
all-MiniLM-L12-v2
model locally on OpenSearch 2.11.1, using the TorchScript model file and config from the list of pre-trained models, deploying from a local zip file with the steps fromopensearch-py-ml
's demo notebook. I have made some suggestions based on my experience below, but I'm not sure if the ONNX model would have different dependencies than the TorchScript model, or if other models have different dependencies (eg, whetherall-mpnet-base-v2
is going to have different dependencies thanall-MiniLM-L12-v2
).Packaging
When using a local Torch model on a server with restricted internet access, deploying the model fails if the server cannot access
publish.djl.ai
. In ml-commons code, this URL is mentioned by thepytorch-engine
library.It might be possible to package a fat jar with dependencies to avoid this issue? This was previously discussed in the OpenSearch forums.
Documentation
It would be useful to document:
Currently, the plugin appears to need network access to the following URLs when deploying, even when using a local model:
[WARN ][a.d.h.z.HfModelZoo ] [ip-172-31-58-14.ec2.internal] Failed to download Huggingface model zoo index: NLP.FILL_MASK
; not sure if this has consequences later)Logging
Another way to improve this experience would be to log more information when there is a failure downloading dependencies.
When deploying a local model, if an egress firewall is configured to drop packets to destinations that are not explicitly permitted, we get an error that doesn't tell us which destination we were trying to reach - from this, it is not obvious what address needs to be whitelisted. Here are the OpenSearch logs when deploying a local model under these circumstances:
Under this circumstance, GET
/_plugins/_ml/models/<model-id>
tells us the deploy failed, but does not provide a reason. (Not sure if the task API would provide more info - I couldn't see how to get opensearch-py-ml to give me the task ID.)Please note, the above is assuming that DNS is permitted. If the egress firewall is also preventing DNS, the error is more useful and does contain the domain that needs to be whitelisted: