pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.04k stars 821 forks source link

Handling of subsequent RegisterModel calls to Management gRPC endpoint with same model & version #3199

Open mihaidusmanu opened 1 week ago

mihaidusmanu commented 1 week ago

🐛 Describe the bug

Subsequent (or concurrent) RegisterModel calls to Management gRPC endpoint with same model & version raise a ConflictStatusException which is not handled. The gRPC client request fails with status UNKNOWN (with an empty message), instead of e.g. ALREADY_EXISTS along a human-readable message "Same model and version is already registered" which would allow for graceful handling of this in multi-threaded / multi-client setups.

Error logs

2024-06-20T16:05:54,602 [DEBUG] grpc-default-executor-0 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model densenet161
Jun 20, 2024 4:05:54 PM io.grpc.internal.SerializingExecutor run
SEVERE: Exception while executing runnable io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@a7a4113
org.pytorch.serve.http.ConflictStatusException: Model version 1.0 is already registered for model densenet161
        at org.pytorch.serve.wlm.ModelVersionedRefs.addVersionModel(ModelVersionedRefs.java:44)
        at org.pytorch.serve.wlm.ModelManager.createVersionedModel(ModelManager.java:481)
        at org.pytorch.serve.wlm.ModelManager.registerModel(ModelManager.java:151)
        at org.pytorch.serve.util.ApiUtils.handleRegister(ApiUtils.java:173)
        at org.pytorch.serve.util.ApiUtils.registerModel(ApiUtils.java:140)
        at org.pytorch.serve.grpcimpl.ManagementImpl.registerModel(ManagementImpl.java:120)
        at org.pytorch.serve.grpc.management.ManagementAPIsServiceGrpc$MethodHandlers.invoke(ManagementAPIsServiceGrpc.java:630)
        at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:355)
        at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:867)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)

Installation instructions

Installed from pip. Not using docker.

Model Packaging

Tutorial densenet161 mar file https://torchserve.s3.amazonaws.com/mar_files/densenet161.mar with default handler

config.properties

No response

Versions

------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch:

torchserve==0.11.0
torch-model-archiver==0.11.0

Python version: 3.10 (64-bit runtime)
Python executable: /home/mihaidusmanu/.venvs/torchserve/bin/python

Versions of relevant python libraries:
captum==0.7.0
numpy==2.0.0
pillow==10.3.0
psutil==6.0.0
torch==2.3.1
torch-model-archiver==0.11.0
torchserve==0.11.0
torchvision==0.18.1
wheel==0.43.0
torch==2.3.1
**Warning: torchtext not present ..
torchvision==0.18.1
**Warning: torchaudio not present ..

Java Version:

OS: Ubuntu 22.04.4 LTS
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
CMake version: version 3.28.3

Environment:
library_path (LD_/DYLD_):

Repro instructions

Console 1

mkdir /tmp/model-store-dbg
wget https://torchserve.s3.amazonaws.com/mar_files/densenet161.mar -O /tmp/model-store-dbg/densenet161.mar
torchserve --start --model-store /tmp/model-store-dbg/ --foreground --no-config-snapshots

Console 2

(Explicit mar_set to make sure it does not get re-downloaded)

> python ts_scripts/torchserve_grpc_client.py register densenet161 densenet161.mar

> outputs
## Check densenet161.mar in mar_set : {'densenet161.mar'}
## Register marfile: densenet161.mar

Model densenet161 registered successfully

followed by

> python ts_scripts/torchserve_grpc_client.py register densenet161 densenet161.mar

> outputs
## Check densenet161.mar in mar_set : {'densenet161.mar'}
## Register marfile: densenet161.mar

Failed to register model densenet161.

Possible Solution

After reading a bit through the code, it seems like APIUtils always calls modelArchive.registerModel with ignoreDuplicate=false. This means that registerModel can throw ConflictStatusException which does not seem to be handled. There are two options:

If any of these options sound good, I can try to implement a fix and send a PR.