Open jlibx opened 1 month ago
When an error occurs, should the zip file be retained for easy comparison? In addition, the current network between the ml node and the data node is not very good, will it affect the content of each block obtained, and ultimately lead to problems with the merged zip file?
private void retrieveModelChunks(MLModel mlModelMeta, ActionListener<File> listener) throws InterruptedException {
String modelId = mlModelMeta.getModelId();
String modelName = mlModelMeta.getName();
Integer totalChunks = mlModelMeta.getTotalChunks();
GetRequest getRequest = new GetRequest();
getRequest.index(ML_MODEL_INDEX);
getRequest.id();
Semaphore semaphore = new Semaphore(1);
AtomicBoolean stopNow = new AtomicBoolean(false);
String modelZip = mlEngine.getDeployModelZipPath(modelId, modelName);
ConcurrentLinkedDeque<File> chunkFiles = new ConcurrentLinkedDeque();
AtomicInteger retrievedChunks = new AtomicInteger(0);
for (int i = 0; i < totalChunks; i++) {
semaphore.tryAcquire(10, TimeUnit.SECONDS);
if (stopNow.get()) {
throw new MLException("Failed to deploy model");
}
String modelChunkId = this.getModelChunkId(modelId, i);
int currentChunk = i;
this.getModel(modelChunkId, threadedActionListener(DEPLOY_THREAD_POOL, ActionListener.wrap(model -> {
Path chunkPath = mlEngine.getDeployModelChunkPath(modelId, currentChunk);
FileUtils.write(Base64.getDecoder().decode(model.getContent()), chunkPath.toString());
chunkFiles.add(new File(chunkPath.toUri()));
retrievedChunks.getAndIncrement();
if (retrievedChunks.get() == totalChunks) {
File modelZipFile = new File(modelZip);
FileUtils.mergeFiles(chunkFiles, modelZipFile);
listener.onResponse(modelZipFile);
}
semaphore.release();
}, e -> {
stopNow.set(true);
semaphore.release();
log.error("Failed to retrieve model chunk " + modelChunkId, e);
if (retrievedChunks.get() == totalChunks - 1) {
listener.onFailure(new MLResourceNotFoundException("Fail to find model chunk " + modelChunkId));
}
})));
}
}
semaphore.tryAcquire(10, TimeUnit.SECONDS);
I think this code will cause chunk confusion when the network is not good.
What operating system are you running this on?
Can you share your exact Operation System name? @libxj
CentOS Linux release 7.9.2009 (Core) But i run it in the docker,base image is opensearchproject/opensearch:2.16.0 Client: Docker Engine - Community Version: 26.1.4 API version: 1.45 Go version: go1.21.11 Git commit: 5650f9b Built: Wed Jun 5 11:32:04 2024 OS/Arch: linux/amd64 Context: default
Server: Docker Engine - Community Engine: Version: 26.1.4 API version: 1.45 (minimum version 1.24) Go version: go1.21.11 Git commit: de5c9cf Built: Wed Jun 5 11:31:02 2024 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.33 GitCommit: d2d58213f83a351ca8f528a95fbd145f5654e957 nvidia: Version: 1.1.12 GitCommit: v1.1.12-0-g51d5e94 docker-init: Version: 0.19.0 GitCommit: de40ad0
Are you not going to fix this bug? Model deployment now depends entirely on luck.Or should I submit a PR?
In addition, the current network between the ml node and the data node is not very good
So, each of your nodes runs in a Docker container and I am guessing from your comment that maybe you have a separate ML node that runs on a host with a GPU and your data node runs on a different host?
Or should I submit a PR
If you have a fix that works, of course, please submit a PR.
In addition, the current network between the ml node and the data node is not very good
So, each of your nodes runs in a Docker container and I am guessing from your comment that maybe you have a separate ML node that runs on a host with a GPU and your data node runs on a different host?
Or should I submit a PR
If you have a fix that works, of course, please submit a PR.
Yes, my data nodes are on other hosts, and the ML node is on a new GPU machine. They are connected through a VPN tunnel, so the probability of a successful deployment is quite low.
Thanks @jlibx for fixing this issue. Have you tested that issue will be gone with the fix ?
What is the bug? opensearch-ml-gpu | [2024-08-09T07:17:54,585][DEBUG][o.o.m.e.u.FileUtils ] [opensearch-ml-gpu] merge 61 files into /usr/share/opensearch/data/ml_cache/models_cache/deploy/HdPgNZEBkGu7typLkQJX/cre_pt_v0_2_0_test2.zip opensearch-ml-gpu | [2024-08-09T07:17:54,782][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check opensearch-ml-gpu | [2024-08-09T07:17:54,961][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 25% opensearch-ml-gpu | [2024-08-09T07:17:54,990][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0% opensearch-ml-gpu | [2024-08-09T07:17:55,461][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26% opensearch-ml-gpu | [2024-08-09T07:17:55,491][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 6% opensearch-ml-gpu | [2024-08-09T07:17:55,898][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check opensearch-ml-gpu | [2024-08-09T07:17:55,962][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26% opensearch-ml-gpu | [2024-08-09T07:17:55,991][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0% opensearch-ml-gpu | [2024-08-09T07:17:56,055][ERROR][o.o.m.m.MLModelManager ] [opensearch-ml-gpu] Model content hash can't match original hash value opensearch-ml-gpu | [2024-08-09T07:17:56,055][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] removing model HdPgNZEBkGu7typLkQJX from cache opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] Setting the auto deploying flag for Model HdPgNZEBkGu7typLkQJX opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.t.MLTaskManager ] [opensearch-ml-gpu] remove ML task from cache aYjwNZEBtLXkNmkPz_gQ opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: cluster:admin/opensearch/mlinternal/forward opensearch-ml-gpu | [2024-08-09T07:17:56,279][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [opensearch-ml-gpu] deploy model task done aYjwNZEBtLXkNmkPz_gQ
How can one reproduce the bug? Steps to reproduce the behavior:
_register POST /_plugins/_ml/models/_register { "name": "cre_pt_v0_2_0_test2", "version": "0.2.0", "model_format": "TORCH_SCRIPT", "function_name": "TEXT_EMBEDDING", "description": "huggingface_cre_v0_2_0_snapshot_norm_pt model 2024.4.26", "url": "xxx.zip", "model_config": { "model_type": "bert", "embedding_dimension": 1024, "framework_type": "SENTENCE_TRANSFORMERS" }, "model_content_hash_value": "197916cdbbeb40903393a3f74c215a6c4cb7e3201a2e0e826ef2b93728e4bf6b" }
_deploy POST /_plugins/_ml/models/HdPgNZEBkGu7typLkQJX/_deploy
result GET /_plugins/_ml/tasks/aYjwNZEBtLXkNmkPz_gQ
{ "model_id": "HdPgNZEBkGu7typLkQJX", "task_type": "DEPLOY_MODEL", "function_name": "TEXT_EMBEDDING", "state": "FAILED", "worker_node": [ "Tv4342EeTaydOgMRthFtrg" ], "create_time": 1723186859791, "last_update_time": 1723187876166, "error": """{"Tv4342EeTaydOgMRthFtrg":"model content changed"}""", "is_async": true }
What is your host/environment?
My hash value is completely correct, according to the official calculation method
shasum -a 256 sentence-transformers_paraphrase-mpnet-base-v2-1.0.0-onnx.zip
,there is no problem when calling _register, but The above error occurred after _deploy