vanvalenlab / kiosk-console

DeepCell Kiosk Distribution for Kubernetes on GKE and AWS
https://deepcell-kiosk.readthedocs.io
Other
35 stars 6 forks source link

Deepcell.org deployment returns error on Predict console #353

Closed zsdqui closed 4 years ago

zsdqui commented 4 years ago

Describe the bug Utilizing the online deployment of DeepCell at DeepCell.org, the predict console returns and error as of 2 days ago.

To Reproduce Steps to reproduce the behavior:

  1. Navigate to deployed kiosk-console at http://www.deepcell.org/predict
  2. Upload any image (bug is reproduced with HeLa_nuclear.png sample data from website) and press submit
  3. See error Job Failed: Traceback (most recent call last): File "/usr/src/app/redis_consumer/consumers/base_consumer.py", line 189, in consume status = self._consume(redis_hash) File "/usr/src/app/redis_consumer/consumers/image_consumer.py", line 220, in _consume scale = self.detect_scale(image) File "/usr/src/app/redis_consumer/consumers/image_consumer.py", line 112, in detect_scale untile=False) File "/usr/src/app/redis_consumer/consumers/base_consumer.py", line 533, in predict model_dtype, untile=untile) File "/usr/src/app/redis_consumer/consumers/base_consumer.py", line 408, in _predict_big_image in_tensor_dtype=model_dtype) File "/usr/src/app/redis_consumer/consumers/base_consumer.py", line 277, in grpc_image prediction = client.predict(req_data, settings.GRPC_TIMEOUT) File "/usr/src/app/redis_consumer/grpc_clients.py", line 185, in predict response = self._retry_grpc(request, request_timeout) File "/usr/src/app/redis_consumer/grpc_clients.py", line 155, in _retry_grpc raise err File "/usr/src/app/redis_consumer/grpc_clients.py", line 123, in _retry_grpc response = api_call(request, timeout=request_timeout) File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 826, in call return _end_unary_response_blocking(state, call, False, None) File "/usr/local/lib/python3.6/site-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking raise _InactiveRpcError(state) grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.INVALID_ARGUMENT details = "Task size 64 is larger than maximum batch size 32" debug_error_string = "{"created":"@1590453871.004536337","description":"Error received from peer ipv4:10.3.252.221:8500","file":"src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Task size 64 is larger than maximum batch size 32","grpc_status":3}"

Expected behavior Segmentation image (was working ~7-10 days ago).

Screenshots If applicable, add screenshots to help explain your problem.

image

Desktop (please complete the following information): Safari 13.1 and Chrome 81.0.4044.138 on macOS

Additional context I've tried with and without auto-resizing and using images of different sizes as well with the same error.

willgraf commented 4 years ago

Thanks for bringing this to our attention! The root cause of this issue is a mismatch between the MAX_BATCH_SIZE of TensorFlow Serving (set to 32 for T4 GPUs) and the TF_MAX_BATCH_SIZE in the segmentation consumer (defaults to 64). After changing the latter from 64 to 32, the issue is resolved.