triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.33k stars 1.48k forks source link

Can't Load Models Using Distributed MinIO #2403

Closed dpsommer closed 3 years ago

dpsommer commented 3 years ago

Description

Shortly after upload, Triton will fail to load models from distributed MinIO.

While I believe the fundamental error lies with MinIO or the underlying S3 implementation, Triton could work around the issue by consistently parsing object paths so as not to make requests with leading / characters in the prefix parameter.

Link to the MinIO issue: https://github.com/minio/minio/issues/11265

Triton Information

What version of Triton are you using?
20.12

Are you using the Triton container or did you build it yourself?
Triton container - tritonserver:20.12-py3

Steps to Reproduce

Set up a distributed MinIO cluster (detailed steps can be found here):

$ docker-compose pull
Pulling minio1 ... done
Pulling minio2 ... done
Pulling minio3 ... done
Pulling minio4 ... done
Pulling nginx  ... done
$ docker-compose up
Creating network "minio_default" with the default driver
Creating volume "minio_data1-1" with default driver
Creating volume "minio_data1-2" with default driver
Creating volume "minio_data2-1" with default driver
Creating volume "minio_data2-2" with default driver
Creating volume "minio_data3-1" with default driver
Creating volume "minio_data3-2" with default driver
Creating volume "minio_data4-1" with default driver
Creating volume "minio_data4-2" with default driver
Creating minio_minio3_1 ... done
Creating minio_minio4_1 ... done
Creating minio_minio2_1 ... done
Creating minio_minio1_1 ... done
Creating minio_nginx_1  ... done
Attaching to minio_minio2_1, minio_minio1_1, minio_minio4_1, minio_minio3_1, minio_nginx_1

Create a bucket (here using the MinIO mc CLI)

$ mc alias set localhost http://localhost:9000 minio minio123
$ mc mb localhost/testbucket

Upload any model (used GoogleNet from the ONNX model repository in testing)

testbucket
    /googlenet
        config.pbtxt
        /1
            model.onnx

config.pbtxt

name: "googlenet"
platform: "onnxruntime_onnx"
input [
  {
    name: "data_0"
    data_type: TYPE_FP32
    dims: [ 1, 3, 224, 224 ]
  }
]
output [
  {
    name: "prob_1"
    data_type: TYPE_FP32
    dims: [ 1, 1000 ]
  }
]
version_policy: { all { }}

Upload to the bucket:

$ mc cp -r googlenet localhost/testbucket

Wait a few minutes (current assumption is that this is due to MinIO cluster sync/replication).

Note that running Triton immediately after upload will often load the model successfully

Run Triton:

$ tritonserver --model-repository=s3://localhost:9000/testbucket

Triton fails with the following error:

E0111 16:01:38.440724 221 model_repository_manager.cc:145] Failed to determine modification time for 's3://localhost:9000/testbucket/googlenet': Internal: Failed to get modification time for object at s3://localhost:9000/testbucket/googlenet
E0111 16:01:38.463213 221 model_repository_manager.cc:1189] failed to load model 'googlenet': at least one version must be available under the version policy of model 'googlenet'

On the MinIO side, following the trace during startup shows that Triton makes the following requests:

16:01:38.396 [200 OK] s3.HeadBucket localhost:9000/testbucket 172.21.0.7        2.239ms      ↑ 174 B ↓ 218 B
16:01:38.400 [200 OK] s3.ListObjectsV1 localhost:9000/testbucket?prefix=  172.21.0.7        4.7ms        ↑ 208 B ↓ 1.1 KiB
16:01:38.400 [200 OK] s3.ListObjectsV1 localhost:9000/testbucket?prefix=  172.21.0.7        5.727ms      ↑ 174 B ↓ 1.2 KiB
16:01:38.407 [200 OK] s3.HeadBucket localhost:9000/testbucket 172.21.0.7        1.037ms      ↑ 174 B ↓ 218 B
16:01:38.409 [200 OK] s3.ListObjectsV1 localhost:9000/testbucket?prefix=googlenet%2F  172.21.0.7        5.651ms      ↑ 174 B ↓ 1.2 KiB
16:01:38.416 [200 OK] s3.HeadBucket localhost:9000/testbucket 172.21.0.7        1.106ms      ↑ 174 B ↓ 218 B
16:01:38.419 [200 OK] s3.ListObjectsV1 localhost:9000/testbucket?prefix=%2Fgooglenet%2F  172.21.0.7        4.677ms      ↑ 208 B ↓ 520 B
16:01:38.418 [200 OK] s3.ListObjectsV1 localhost:9000/testbucket?prefix=%2Fgooglenet%2F  172.21.0.7        5.261ms      ↑ 174 B ↓ 556 B
16:01:38.425 [200 OK] s3.HeadBucket localhost:9000/testbucket 172.21.0.7        1.164ms      ↑ 174 B ↓ 218 B
16:01:38.427 [200 OK] s3.ListObjectsV1 localhost:9000/testbucket?prefix=%2Fgooglenet%2F  172.21.0.7        5.877ms      ↑ 174 B ↓ 520 B
16:01:38.436 [404 Not Found] s3.HeadObject localhost:9000/testbucket/googlenet 172.21.0.7        3.622ms      ↑ 174 B ↓ 225 B
16:01:38.441 [200 OK] s3.HeadObject localhost:9000/testbucket/googlenet/config.pbtxt 172.21.0.7        3.433ms      ↑ 174 B ↓ 345 B
16:01:38.446 [200 OK] s3.HeadObject localhost:9000/testbucket/googlenet/config.pbtxt 172.21.0.7        3.717ms      ↑ 174 B ↓ 345 B
16:01:38.451 [200 OK] s3.GetObject localhost:9000/testbucket/googlenet/config.pbtxt 172.21.0.7        4.152ms      ↑ 174 B ↓ 595 B
16:01:38.458 [200 OK] s3.ListObjectsV1 localhost:9000/testbucket?prefix=%2Fgooglenet%2F  172.21.0.7        4.611ms      ↑ 208 B ↓ 520 B
16:01:38.457 [200 OK] s3.ListObjectsV1 localhost:9000/testbucket?prefix=%2Fgooglenet%2F  172.21.0.7        5.21ms       ↑ 174 B ↓ 556 B

The requests with a prefix value containing a leading slash return empty result sets:

minio4 [REQUEST s3.ListObjectsV1] 16:03:27.910
minio4 GET /testbucket?prefix=%2Fgooglenet%2F
minio4 Proto: HTTP/1.1
minio4 Host: localhost:9000
minio4 Content-Type: application/xml
minio4 Http2-Settings: AAMAAABkAARAAAAAAAIAAAAA
minio4 User-Agent: aws-sdk-cpp/1.7.129 Linux/4.19.76-linuxkit x86_64 GCC/9.3.0
minio4 X-Amz-Api-Version: 2006-03-01
minio4 X-Amz-Content-Sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
minio4 X-Amz-Date: 20210111T160327Z
minio4 X-Forwarded-For: 172.21.0.7
minio4 X-Forwarded-Proto: http
minio4 Accept: */*
minio4 Authorization: AWS4-HMAC-SHA256 Credential=minio/20210111/us-east-1/s3/aws4_request, SignedHeaders=content-type;host;x-amz-api-version;x-amz-content-sha256;x-amz-date, Signature=6144af315da19643293b6888155f13732c4156a63e3d062b2b92479e56fa6c09
minio4 Content-Length: 0
minio4 X-Real-Ip: 172.21.0.7
minio4 
minio4 [RESPONSE] [16:03:27.917] [ Duration 7.146ms  ↑ 174 B  ↓ 556 B ]
minio4 200 OK
minio4 Vary: Origin
minio4 Content-Length: 270
minio4 Content-Security-Policy: block-all-mixed-content
minio4 Content-Type: application/xml
minio4 Date: Mon, 11 Jan 2021 16:03:27 GMT
minio4 Server: MinIO/RELEASE.2020-11-13T20-10-18Z
minio4 X-Amz-Request-Id: 165938FE9992F230
minio4 X-Xss-Protection: 1; mode=block
minio4 Accept-Ranges: bytes
minio4 <?xml version="1.0" encoding="UTF-8"?>
<ListBucketResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Name>testbucket</Name><Prefix>/googlenet/</Prefix><Marker></Marker><MaxKeys>4500</MaxKeys><Delimiter></Delimiter><IsTruncated>false</IsTruncated></ListBucketResult>

This causes Triton to fail to find the model and error out.

Expected Behavior

Triton starts up without error when run against a distributed MinIO cluster containing valid models and model configuration.

CoderHam commented 3 years ago

@dpsommer Can you repeat the same experiment with polling turned off?

dpsommer commented 3 years ago

@CoderHam with --model-control-mode=none I get the same error, and with --model-control-mode=explicit --load-model=googlenet Triton starts but can't load the model (No model version was found).

CoderHam commented 3 years ago

@dpsommer when you use a single miniIO instance do you not see this issue? We might need to add some additional handling for distributed MinIO S3

dpsommer commented 3 years ago

@CoderHam Single-instance MinIO has worked fine so far

AlexanderEkdahl commented 3 years ago

MinIO has since been updated to consistently handle leading slash in the request(https://github.com/minio/minio/pull/11268).

As far as I understand, requests to S3 keys with a leading slash does not make sense since no key in S3 can have a leading slash. Other S3 clients handle this by cleaning the path before sending the request to AWS(aws-sdk-go) and I believe Triton should be doing the same.

This issue is preventing Triton being used with on-prem storage solutions such as Dell EMC and MinIO.

CoderHam commented 3 years ago

@dpsommer To clarify this ticket was closed because @AlexanderEkdahl's feature request regarding handling extra slashes was resolved. We still need to test and verify support for multiple s3 repositories in the same instance of tritonserver.

SSJDiVaD commented 3 years ago

Hi @CoderHam,

I'm a colleague of @AlexanderEkdahl and @dpsommer. I just tested this problem using @dpsommer's procedure (mostly) in the the newest Triton image (21.02) and I found a similar (but slightly different) error message.

The old error message was:

NVIDIA Release 20.12 (build 18156940)
...
E0309 18:49:15.852366 1 model_repository_manager.cc:145] Failed to determine modification time for 's3://nginx:9000/testbucket/googlenet': Internal: Failed to get modification time for object at s3://nginx:9000/testbucket/googlenet
E0309 18:49:15.883318 1 model_repository_manager.cc:1189] failed to load model 'googlenet': at least one version must be available under the version policy of model 'googlenet'

And the new one is:

NVIDIA Release 21.02 (build 20174689)
...
E0309 18:50:26.199354 1 model_repository_manager.cc:145] Failed to determine modification time for 's3://nginx:9000/testbucket/googlenet': Internal: Failed to get modification time for object at s3://nginx:9000/testbucket/googlenet
I0309 18:50:26.264021 1 model_repository_manager.cc:787] loading: googlenet:1
E0309 18:50:26.382249 1 model_repository_manager.cc:963] failed to load 'googlenet' version 1: Internal: directory does not exist at s3://nginx:9000/testbucket/googlenet

Can I ask you to re-open this ticket? Thank you!

CoderHam commented 3 years ago

@dpsommer @SSJDiVaD can you share your email addresses with me so I can share the latest master build for you to test on? It contains verbose logging for S3 Storage and could help you understand the cause of the issue better.

SSJDiVaD commented 3 years ago

Thanks CoderHam! I'll be the one working on this issue from our end. My email address is david.szeto@borealisai.com

CoderHam commented 3 years ago

@SSJDiVaD did you get a chance to test the docker image? I am waiting on you to share the verbose log errors you see to be able to triage this issue.

CoderHam commented 3 years ago

We have a couple fixes (#2879 and #2880) that should solve your issue if it isn't already. Feel free to try a build out these patches and get back to us with your feedback. Closing this ticket due to inactivity. Please let us know if you still face the problem.