opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
90 stars 129 forks source link

[BUG]Model content hash can't match original hash value #2819

Open jlibx opened 1 month ago

jlibx commented 1 month ago

What is the bug? opensearch-ml-gpu | [2024-08-09T07:17:54,585][DEBUG][o.o.m.e.u.FileUtils ] [opensearch-ml-gpu] merge 61 files into /usr/share/opensearch/data/ml_cache/models_cache/deploy/HdPgNZEBkGu7typLkQJX/cre_pt_v0_2_0_test2.zip opensearch-ml-gpu | [2024-08-09T07:17:54,782][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check opensearch-ml-gpu | [2024-08-09T07:17:54,961][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 25% opensearch-ml-gpu | [2024-08-09T07:17:54,990][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0% opensearch-ml-gpu | [2024-08-09T07:17:55,461][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26% opensearch-ml-gpu | [2024-08-09T07:17:55,491][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 6% opensearch-ml-gpu | [2024-08-09T07:17:55,898][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: internal:coordination/fault_detection/leader_check opensearch-ml-gpu | [2024-08-09T07:17:55,962][DEBUG][o.o.n.r.t.AverageMemoryUsageTracker] [opensearch-ml-gpu] Recording memory usage: 26% opensearch-ml-gpu | [2024-08-09T07:17:55,991][DEBUG][o.o.n.r.t.AverageCpuUsageTracker] [opensearch-ml-gpu] Recording cpu usage: 0% opensearch-ml-gpu | [2024-08-09T07:17:56,055][ERROR][o.o.m.m.MLModelManager ] [opensearch-ml-gpu] Model content hash can't match original hash value opensearch-ml-gpu | [2024-08-09T07:17:56,055][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] removing model HdPgNZEBkGu7typLkQJX from cache opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.m.MLModelCacheHelper] [opensearch-ml-gpu] Setting the auto deploying flag for Model HdPgNZEBkGu7typLkQJX opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.m.t.MLTaskManager ] [opensearch-ml-gpu] remove ML task from cache aYjwNZEBtLXkNmkPz_gQ opensearch-ml-gpu | [2024-08-09T07:17:56,148][DEBUG][o.o.t.TransportService ] [opensearch-ml-gpu] Action: cluster:admin/opensearch/mlinternal/forward opensearch-ml-gpu | [2024-08-09T07:17:56,279][INFO ][o.o.m.a.d.TransportDeployModelOnNodeAction] [opensearch-ml-gpu] deploy model task done aYjwNZEBtLXkNmkPz_gQ

How can one reproduce the bug? Steps to reproduce the behavior:

  1. _register POST /_plugins/_ml/models/_register { "name": "cre_pt_v0_2_0_test2", "version": "0.2.0", "model_format": "TORCH_SCRIPT", "function_name": "TEXT_EMBEDDING", "description": "huggingface_cre_v0_2_0_snapshot_norm_pt model 2024.4.26", "url": "xxx.zip", "model_config": { "model_type": "bert", "embedding_dimension": 1024, "framework_type": "SENTENCE_TRANSFORMERS" }, "model_content_hash_value": "197916cdbbeb40903393a3f74c215a6c4cb7e3201a2e0e826ef2b93728e4bf6b" }

  2. _deploy POST /_plugins/_ml/models/HdPgNZEBkGu7typLkQJX/_deploy

  3. result GET /_plugins/_ml/tasks/aYjwNZEBtLXkNmkPz_gQ

{ "model_id": "HdPgNZEBkGu7typLkQJX", "task_type": "DEPLOY_MODEL", "function_name": "TEXT_EMBEDDING", "state": "FAILED", "worker_node": [ "Tv4342EeTaydOgMRthFtrg" ], "create_time": 1723186859791, "last_update_time": 1723187876166, "error": """{"Tv4342EeTaydOgMRthFtrg":"model content changed"}""", "is_async": true }

What is your host/environment?

My hash value is completely correct, according to the official calculation method shasum -a 256 sentence-transformers_paraphrase-mpnet-base-v2-1.0.0-onnx.zip,there is no problem when calling _register, but The above error occurred after _deploy

jlibx commented 1 month ago

When an error occurs, should the zip file be retained for easy comparison? In addition, the current network between the ml node and the data node is not very good, will it affect the content of each block obtained, and ultimately lead to problems with the merged zip file?

jlibx commented 1 month ago
private void retrieveModelChunks(MLModel mlModelMeta, ActionListener<File> listener) throws InterruptedException {
        String modelId = mlModelMeta.getModelId();
        String modelName = mlModelMeta.getName();
        Integer totalChunks = mlModelMeta.getTotalChunks();
        GetRequest getRequest = new GetRequest();
        getRequest.index(ML_MODEL_INDEX);
        getRequest.id();
        Semaphore semaphore = new Semaphore(1);
        AtomicBoolean stopNow = new AtomicBoolean(false);
        String modelZip = mlEngine.getDeployModelZipPath(modelId, modelName);
        ConcurrentLinkedDeque<File> chunkFiles = new ConcurrentLinkedDeque();
        AtomicInteger retrievedChunks = new AtomicInteger(0);
        for (int i = 0; i < totalChunks; i++) {
            semaphore.tryAcquire(10, TimeUnit.SECONDS);
            if (stopNow.get()) {
                throw new MLException("Failed to deploy model");
            }
            String modelChunkId = this.getModelChunkId(modelId, i);
            int currentChunk = i;
            this.getModel(modelChunkId, threadedActionListener(DEPLOY_THREAD_POOL, ActionListener.wrap(model -> {
                Path chunkPath = mlEngine.getDeployModelChunkPath(modelId, currentChunk);
                FileUtils.write(Base64.getDecoder().decode(model.getContent()), chunkPath.toString());
                chunkFiles.add(new File(chunkPath.toUri()));
                retrievedChunks.getAndIncrement();
                if (retrievedChunks.get() == totalChunks) {
                    File modelZipFile = new File(modelZip);
                    FileUtils.mergeFiles(chunkFiles, modelZipFile);
                    listener.onResponse(modelZipFile);
                }
                semaphore.release();
            }, e -> {
                stopNow.set(true);
                semaphore.release();
                log.error("Failed to retrieve model chunk " + modelChunkId, e);
                if (retrievedChunks.get() == totalChunks - 1) {
                    listener.onFailure(new MLResourceNotFoundException("Fail to find model chunk " + modelChunkId));
                }
            })));
        }
    }

semaphore.tryAcquire(10, TimeUnit.SECONDS); I think this code will cause chunk confusion when the network is not good.

austintlee commented 1 month ago

What operating system are you running this on?

Zhangxunmt commented 1 month ago

Can you share your exact Operation System name? @libxj

jlibx commented 1 month ago

CentOS Linux release 7.9.2009 (Core) But i run it in the docker,base image is opensearchproject/opensearch:2.16.0 Client: Docker Engine - Community Version: 26.1.4 API version: 1.45 Go version: go1.21.11 Git commit: 5650f9b Built: Wed Jun 5 11:32:04 2024 OS/Arch: linux/amd64 Context: default

Server: Docker Engine - Community Engine: Version: 26.1.4 API version: 1.45 (minimum version 1.24) Go version: go1.21.11 Git commit: de5c9cf Built: Wed Jun 5 11:31:02 2024 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.33 GitCommit: d2d58213f83a351ca8f528a95fbd145f5654e957 nvidia: Version: 1.1.12 GitCommit: v1.1.12-0-g51d5e94 docker-init: Version: 0.19.0 GitCommit: de40ad0

jlibx commented 1 month ago

image

jlibx commented 3 weeks ago

Are you not going to fix this bug? Model deployment now depends entirely on luck.Or should I submit a PR?

austintlee commented 2 weeks ago

In addition, the current network between the ml node and the data node is not very good

So, each of your nodes runs in a Docker container and I am guessing from your comment that maybe you have a separate ML node that runs on a host with a GPU and your data node runs on a different host?

Or should I submit a PR

If you have a fix that works, of course, please submit a PR.

jlibx commented 2 weeks ago

In addition, the current network between the ml node and the data node is not very good

So, each of your nodes runs in a Docker container and I am guessing from your comment that maybe you have a separate ML node that runs on a host with a GPU and your data node runs on a different host?

Or should I submit a PR

If you have a fix that works, of course, please submit a PR.

Yes, my data nodes are on other hosts, and the ML node is on a new GPU machine. They are connected through a VPN tunnel, so the probability of a successful deployment is quite low.

ylwu-amzn commented 1 week ago

Thanks @jlibx for fixing this issue. Have you tested that issue will be gone with the fix ?