Closed dpsommer closed 3 years ago
@dpsommer Can you repeat the same experiment with polling turned off?
@CoderHam with --model-control-mode=none
I get the same error, and with --model-control-mode=explicit --load-model=googlenet
Triton starts but can't load the model (No model version was found).
@dpsommer when you use a single miniIO instance do you not see this issue? We might need to add some additional handling for distributed MinIO S3
@CoderHam Single-instance MinIO has worked fine so far
MinIO has since been updated to consistently handle leading slash in the request(https://github.com/minio/minio/pull/11268).
As far as I understand, requests to S3 keys with a leading slash does not make sense since no key in S3 can have a leading slash. Other S3 clients handle this by cleaning the path before sending the request to AWS(aws-sdk-go) and I believe Triton should be doing the same.
This issue is preventing Triton being used with on-prem storage solutions such as Dell EMC and MinIO.
@dpsommer To clarify this ticket was closed because @AlexanderEkdahl's feature request regarding handling extra slashes was resolved. We still need to test and verify support for multiple s3 repositories in the same instance of tritonserver.
Hi @CoderHam,
I'm a colleague of @AlexanderEkdahl and @dpsommer. I just tested this problem using @dpsommer's procedure (mostly) in the the newest Triton image (21.02) and I found a similar (but slightly different) error message.
The old error message was:
NVIDIA Release 20.12 (build 18156940)
...
E0309 18:49:15.852366 1 model_repository_manager.cc:145] Failed to determine modification time for 's3://nginx:9000/testbucket/googlenet': Internal: Failed to get modification time for object at s3://nginx:9000/testbucket/googlenet
E0309 18:49:15.883318 1 model_repository_manager.cc:1189] failed to load model 'googlenet': at least one version must be available under the version policy of model 'googlenet'
And the new one is:
NVIDIA Release 21.02 (build 20174689)
...
E0309 18:50:26.199354 1 model_repository_manager.cc:145] Failed to determine modification time for 's3://nginx:9000/testbucket/googlenet': Internal: Failed to get modification time for object at s3://nginx:9000/testbucket/googlenet
I0309 18:50:26.264021 1 model_repository_manager.cc:787] loading: googlenet:1
E0309 18:50:26.382249 1 model_repository_manager.cc:963] failed to load 'googlenet' version 1: Internal: directory does not exist at s3://nginx:9000/testbucket/googlenet
Can I ask you to re-open this ticket? Thank you!
@dpsommer @SSJDiVaD can you share your email addresses with me so I can share the latest master build for you to test on? It contains verbose logging for S3 Storage and could help you understand the cause of the issue better.
Thanks CoderHam! I'll be the one working on this issue from our end. My email address is david.szeto@borealisai.com
@SSJDiVaD did you get a chance to test the docker image? I am waiting on you to share the verbose log errors you see to be able to triage this issue.
We have a couple fixes (#2879 and #2880) that should solve your issue if it isn't already. Feel free to try a build out these patches and get back to us with your feedback. Closing this ticket due to inactivity. Please let us know if you still face the problem.
Description
Shortly after upload, Triton will fail to load models from distributed MinIO.
While I believe the fundamental error lies with MinIO or the underlying S3 implementation, Triton could work around the issue by consistently parsing object paths so as not to make requests with leading
/
characters in theprefix
parameter.Link to the MinIO issue: https://github.com/minio/minio/issues/11265
Triton Information
What version of Triton are you using?
20.12
Are you using the Triton container or did you build it yourself?
Triton container - tritonserver:20.12-py3
Steps to Reproduce
Set up a distributed MinIO cluster (detailed steps can be found here):
Create a bucket (here using the MinIO
mc
CLI)Upload any model (used GoogleNet from the ONNX model repository in testing)
config.pbtxt
Upload to the bucket:
Wait a few minutes (current assumption is that this is due to MinIO cluster sync/replication).
Run Triton:
Triton fails with the following error:
On the MinIO side, following the trace during startup shows that Triton makes the following requests:
The requests with a
prefix
value containing a leading slash return empty result sets:This causes Triton to fail to find the model and error out.
Expected Behavior
Triton starts up without error when run against a distributed MinIO cluster containing valid models and model configuration.