pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.04k stars 821 forks source link

Worker dead and yet describe_model gives me `worker.status: Ready` #3191

Open MohamedAliRashad opened 2 weeks ago

MohamedAliRashad commented 2 weeks ago

🐛 Describe the bug

Worder dead and yet describe_model gives me worker.status: Ready

Error logs

This is what describe_model returned to me:

{'modelName': 'myModel', 'modelVersion': '1.0', 'modelUrl': 'myModel.mar', 'runtime': 'python', 'minWorkers': 2, 'maxWorkers': 2, 'batchSize': 8, 'maxBatchDelay': 50, 'loadedAtStartup': True, 'workers': [{'id': '9000', 'startTime': '2024-06-13T01:46:32.808Z', 'status': 'READY', 'memoryUsage': 1383956480, 'pid': 1501726, 'gpu': True, 'gpuUsage': {'gpuId': 0, 'utilization.gpu': 0, 'utilization.memory': 0, 'memory.used': 8481}}, {'id': '9001', 'startTime': '2024-06-13T01:46:32.809Z', 'status': 'READY', 'memoryUsage': 1363308544, 'pid': 1501727, 'gpu': True, 'gpuUsage': {'gpuId': 0, 'utilization.gpu': 0, 'utilization.memory': 0, 'memory.used': 8481}}], 'jobQueueStatus': {'remainingCapacity': 0, 'pendingRequests': 1000}}

And then when i predict it gives me this error:

        debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-06-13T14:04:08.580558509+00:00", grpc_status:13, grpc_message:"Model \"myModel\" has no worker to serve inference request. Please use scale workers API to add workers. If this is a sequence inference, please check if it is closed, or expired; or exceeds maxSequenceJobQueueSize\nInternalServerException.()"}"

Installation instructions

Docker: torchserve 0.10.0

Model Packaging

torch model archiver with custom handler

config.properties

async_logging=true
certificate_file=centralized_cert.pem
default_response_timeout=20
enable_envvars_config=true
enable_grpc_ssl=false
grpc_inference_port=7070
grpc_management_address=7071
job_queue_size=1000
model_store=/app/model-store
models={"myModel"\: {"1.0"\: {"defaultVersion"\: true, "marName"\: "myModel.mar", "minWorkers"\: "2", "maxWorkers"\: "2", "batchSize"\: "8", "maxBatchDelay"\: 50}}}
private_key_file=private_key.key

Versions

I am using docker torchserve:0.10.0 image and this is the output of pip freeze:

aniso8601==9.0.1
ansi2html==1.9.1
anyio==4.4.0
arrow==1.3.0
astor==0.8.1
attrdict==2.0.1
Babel==2.15.0
bce-python-sdk==0.9.14
beautifulsoup4==4.12.3
blinker==1.7.0
cachetools==5.3.3
captum==0.6.0
certifi==2024.2.2
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
contourpy==1.2.0
cssselect==1.2.0
cssutils==2.11.1
cycler==0.12.1
Cython==3.0.5
decorator==5.1.1
enum-compat==0.0.3
et-xmlfile==1.1.0
exceptiongroup==1.2.1
filelock==3.13.1
fire==0.6.0
Flask==3.0.2
flask-babel==4.0.0
Flask-RESTful==0.3.10
fonttools==4.49.0
fsspec==2024.2.0
future==1.0.0
h11==0.14.0
httpcore==1.0.5
httpx==0.27.0
huggingface-hub==0.23.3
idna==3.6
imageio==2.34.1
imgaug==0.4.0
importlib_metadata==7.0.2
importlib_resources==6.3.0
itsdangerous==2.1.2
Jinja2==3.1.3
joblib==1.4.2
jproperties==2.1.1
kiwisolver==1.4.5
lazy_loader==0.4
lmdb==1.4.1
lxml==5.2.2
MarkupSafe==2.1.5
matplotlib==3.8.3
more-itertools==10.3.0
mpmath==1.3.0
networkx==3.2.1
ninja==1.11.1.1
numpy==1.24.3
nvgpu==0.10.0
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
opencv-contrib-python==4.6.0.66
opencv-python==4.6.0.66
opencv-python-headless==4.10.0.82
openpyxl==3.1.4
opt-einsum==3.3.0
packaging==23.2
paddleocr==2.7.0.3
paddlepaddle-gpu==2.6.0
pandas==2.2.1
pdf2docx==0.5.8
pillow==10.2.0
premailer==3.10.0
protobuf==5.27.1
psutil==5.9.5
pyclipper==1.3.0.post5
pycryptodome==3.20.0
PyMuPDF==1.20.2
pynvml==11.5.0
pyparsing==3.1.2
python-dateutil==2.9.0.post0
python-docx==1.1.2
pytz==2024.1
PyYAML==6.0
rapidfuzz==3.9.3
rarfile==4.2
regex==2024.5.15
requests==2.31.0
safetensors==0.4.3
scikit-image==0.22.0
scikit-learn==1.5.0
scipy==1.13.1
sentence-transformers==2.5.1
shapely==2.0.4
six==1.16.0
sniffio==1.3.1
soupsieve==2.5
sympy==1.12
tabulate==0.9.0
termcolor==2.4.0
threadpoolctl==3.5.0
tifffile==2024.5.22
timm==0.8.17.dev0
tokenizers==0.19.1
torch==2.2.1+cu121
torch-model-archiver==0.10.0
torch-workflow-archiver==0.2.12
torchaudio==2.2.1+cu121
torchdata==0.7.1
torchserve==0.10.0
torchtext==0.17.1
torchvision==0.17.1+cu121
tqdm==4.66.2
transformers==4.41.2
triton==2.2.0
types-python-dateutil==2.8.19.20240311
typing_extensions==4.10.0
tzdata==2024.1
urllib3==2.2.1
visualdl==2.5.3
Werkzeug==3.0.1
zipp==3.18.1

Repro instructions

I stress tested a model until it gave me IllegalStateException and all my worker died. Then i sent a management request and a ping and both returned to me that everything is alright while its not.

Possible Solution

The best solution i am thinking of is that when an error like this happen i check the error message and if its telling me to scale the workers i just do, but this shouldn't be the correct behaviour and describe_model should show me the correct information of the workers.