Subsequent (or concurrent) RegisterModel calls to Management gRPC endpoint with same model & version raise a ConflictStatusException which is not handled. The gRPC client request fails with status UNKNOWN (with an empty message), instead of e.g. ALREADY_EXISTS along a human-readable message "Same model and version is already registered" which would allow for graceful handling of this in multi-threaded / multi-client setups.
Error logs
2024-06-20T16:05:54,602 [DEBUG] grpc-default-executor-0 org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model densenet161
Jun 20, 2024 4:05:54 PM io.grpc.internal.SerializingExecutor run
SEVERE: Exception while executing runnable io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@a7a4113
org.pytorch.serve.http.ConflictStatusException: Model version 1.0 is already registered for model densenet161
at org.pytorch.serve.wlm.ModelVersionedRefs.addVersionModel(ModelVersionedRefs.java:44)
at org.pytorch.serve.wlm.ModelManager.createVersionedModel(ModelManager.java:481)
at org.pytorch.serve.wlm.ModelManager.registerModel(ModelManager.java:151)
at org.pytorch.serve.util.ApiUtils.handleRegister(ApiUtils.java:173)
at org.pytorch.serve.util.ApiUtils.registerModel(ApiUtils.java:140)
at org.pytorch.serve.grpcimpl.ManagementImpl.registerModel(ManagementImpl.java:120)
at org.pytorch.serve.grpc.management.ManagementAPIsServiceGrpc$MethodHandlers.invoke(ManagementAPIsServiceGrpc.java:630)
at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:355)
at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:867)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
> python ts_scripts/torchserve_grpc_client.py register densenet161 densenet161.mar
> outputs
## Check densenet161.mar in mar_set : {'densenet161.mar'}
## Register marfile: densenet161.mar
Failed to register model densenet161.
Possible Solution
After reading a bit through the code, it seems like APIUtils always calls modelArchive.registerModel with ignoreDuplicate=false. This means that registerModel can throw ConflictStatusException which does not seem to be handled. There are two options:
Silently ignore duplicates by setting ignoreDuplicate=true although it could be a bit misleading
Do no ignore duplicates and instead properly handle the ConflictStatusExpection and exposing it to clients as e.g. grpc.ALREADY_EXISTS
If any of these options sound good, I can try to implement a fix and send a PR.
🐛 Describe the bug
Subsequent (or concurrent) RegisterModel calls to Management gRPC endpoint with same model & version raise a ConflictStatusException which is not handled. The gRPC client request fails with status UNKNOWN (with an empty message), instead of e.g. ALREADY_EXISTS along a human-readable message "Same model and version is already registered" which would allow for graceful handling of this in multi-threaded / multi-client setups.
Error logs
Installation instructions
Installed from pip. Not using docker.
Model Packaging
Tutorial densenet161 mar file https://torchserve.s3.amazonaws.com/mar_files/densenet161.mar with default handler
config.properties
No response
Versions
Repro instructions
Console 1
Console 2
(Explicit mar_set to make sure it does not get re-downloaded)
followed by
Possible Solution
After reading a bit through the code, it seems like APIUtils always calls
modelArchive.registerModel
withignoreDuplicate=false
. This means that registerModel can throwConflictStatusException
which does not seem to be handled. There are two options:ignoreDuplicate=true
although it could be a bit misleadingConflictStatusExpection
and exposing it to clients as e.g.grpc.ALREADY_EXISTS
If any of these options sound good, I can try to implement a fix and send a PR.