Open emilwallner opened 7 months ago
Hi @emilwallner, thanks for the extensive issue report.
My thought on this are:
Thats all I have for now but happy to continue spitballing and iterating over this until you find s solution!
Best Matthias
Really, really appreciate your input, @mreso!
I also realized CUDA_LAUNCH_BLOCKING 1 reduces performance by about 70%, so I'll turn it off for now.
Here's my updated check:
def preprocess_and_stack_images(self, images):
preprocessed_images = []
for i, img in enumerate(images):
try:
preprocessed_img = self.resize_tensor(img)
if preprocessed_img.shape != (3, 768, 768) or preprocessed_img.min() < 0 or preprocessed_img.max() > 1 or preprocessed_img.dtype != torch.float32:
# Log information about the image that doesn't meet the requirements
logger.info(f"Image {i} does not meet the requirements. Replacing with a blank image.")
preprocessed_img = torch.zeros((3, 768, 768))
except Exception as e:
# Log the error message and load a blank image
logger.error(f"Error occurred while processing Image {i}: {str(e)}. Loading a blank image.")
preprocessed_img = torch.zeros((3, 768, 768))
preprocessed_images.append(preprocessed_img)
images_batch = torch.stack(preprocessed_images, dim=0)
if len(images_batch.shape) == 3:
images_batch = images_batch.unsqueeze(0)
# Second test: Check if the size is (1, 3, 768, 768)
if images_batch.shape != (1, 3, 768, 768):
# Log information about the batch that doesn't meet the requirements
logger.info(f"Batch size {images_batch.shape} does not match the required shape (1, 3, 768, 768). Replacing with a blank batch.")
images_batch = torch.zeros((1, 3, 768, 768))
return images_batch
Again, really appreciate the brainstorming ā letās keep at it until we crack this!
Yeah, performance will suffer significant from CUDA_LAUNCH_BLOCKING as kernels will not run asynchronously. So only activate if really necessary for debugging.
You could try to run the model in a notebook with a (1,6,768,768) input and observe the memory usage compared to (2,3,768,768). Wondering why this actually seem to to work in the first place.
I havenāt tried the (1,6,768,768) input yet, but since our model is based on three channels, it should throw an error during execution.
Now, I double-check the size (1,3,768,768), dtype, and ensured the values are in the correct range. Despite that, Iām still hitting a CUDA error: device-side assert triggered when moving the batch with images_batch = images_batch.to(self.device).detach()
Got any more suggestions on what might be causing this?
Cross-post from here with a stacktrace pointing to a real indexing error.
Please check {{management_address}}/models/
"jobQueueStatus": { "remainingCapacity": 100, "pendingRequests": 0 }
I found this issue appears randomly when pendingRequests does not increases.
š Describe the bug
Hey,
First of all, thanks for creating such a fantastic open-source production server.
I'm reaching out due an unexpected issue I can't solve. I've been running a torch serve server in production for over a year (several million requests per week) and it's been working great, however, a few weeks ago it started crashing every 1-5 days.
I enabled export CUDA_LAUNCH_BLOCKING=1, and it gives me a CUDA error: device-side assert triggered, and CUDA out of memory when I move my data to the GPU. I also log, torch.cuda.max_memory_allocated(), and torch.cuda.memory_allocated().
I thought some unique edge case caused a memory leak, some mismatched shapes or NaN values when I moved to the GPU, or allocating too much memory. However, the models use 6180 MiB / 23028 MiB, and torch.cuda.max_memory_allocated logs around 366 MB.
When I SSH into an instance that has crashed it looks like this:
https://github.com/pytorch/serve/assets/12543699/a6291f5e-ff26-448e-a216-cdc90029b6ed
The memory is at 6180 MiB, the GPU utilization flickers between 0-16%, and it gives me the CUDA error: device-side assert triggered, and CUDA out of memory.
Unfortunately, I can't find a way to reproduce the error, it happens at random every 1-5 days, and I have to reset the server and allocate a new instance. I've done everything I can think of to check the data before allocating it to the GPU, and reducing any memory overload or potential memory leak.
Error logs
Installation instructions
torchserve==0.10.0
Docker image: nvcr.io/nvidia/pytorch:22.12-py3
Ubuntu 20.04 including Python 3.8 NVIDIA CUDAĀ® 11.8.0 NVIDIA cuBLAS 11.11.3.6 NVIDIA cuDNN 8.7.0.84 NVIDIA NCCL 2.15.5 (optimized for NVIDIA NVLinkĀ®) NVIDIA RAPIDSā¢ 22.10.01 (For x86, only these libraries are included: cudf, xgboost, rmm, cuml, and cugraph.) Apex rdma-core 36.0 NVIDIA HPC-X 2.13 OpenMPI 4.1.4+ GDRCopy 2.3 TensorBoard 2.9.0 Nsight Compute 2022.3.0.0 Nsight Systems 2022.4.2.1 NVIDIA TensorRTā¢ 8.5.1 Torch-TensorRT 1.1.0a0 NVIDIA DALIĀ® 1.20.0 MAGMA 2.6.2 JupyterLab 2.3.2 including Jupyter-TensorBoard TransformerEngine 0.3.0
Model Packaing
The error comes when I move the images_batch to GPU
config.properties
inference_address=http://0.0.0.0:8510 management_address=http://0.0.0.0:8511 metrics_address=https://0.0.0.0:8512 number_of_netty_threads=8 netty_client_threads=8 async_logging=true enable_metrics_api=false default_workers_per_model=1 max_request_size=20000000 max_response_size=20000000 job_queue_size=100 model_store=./model_store load_models=all models={\ "palette_caption": {\ "1.0": {\ "defaultVersion": true,\ "marName": "palette_caption.mar",\ "minWorkers": 1,\ "maxWorkers": 3,\ "batchSize": 4,\ "maxBatchDelay": 20,\ "responseTimeout": 180\ }\ },\ "palette_colorizer": {\ "1.0": {\ "defaultVersion": true,\ "marName": "palette_colorizer.mar",\ "minWorkers": 2,\ "maxWorkers": 4,\ "batchSize": 4,\ "maxBatchDelay": 20,\ "responseTimeout": 120\ }\ },\ "palette_ref_colorizer": {\ "1.0": {\ "defaultVersion": true,\ "marName": "palette_ref_colorizer.mar",\ "minWorkers": 1,\ "maxWorkers": 2,\ "batchSize": 4,\ "maxBatchDelay": 20,\ "responseTimeout": 120\ }\ }\ }
Versions
Pip freeze:
absl-py==1.3.0 aiohttp==3.8.4 aiosignal==1.3.1 aniso8601==9.0.1 annoy==1.17.1 ansi2html==1.9.1 anyio==4.3.0 apex==0.1 appdirs==1.4.4 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 arrow==1.3.0 asttokens==2.2.1 astunparse==1.6.3 async-timeout==4.0.3 attrs==22.1.0 audioread==3.0.0 backcall==0.2.0 beautifulsoup4==4.11.1 bleach==5.0.1 blinker==1.7.0 blis==0.7.9 cachetools==5.2.0 catalogue==2.0.8 certifi==2022.12.7 cffi==1.15.1 charset-normalizer==2.1.1 click==8.1.3 cloudpickle==2.2.0 cmake==3.24.1.1 comm==0.1.2 confection==0.0.3 contourpy==1.0.6 cuda-python @ file:///rapids/cuda_python-11.7.0%2B0.g95a2041.dirty-cp38-cp38-linux_x86_64.whl cudf @ file:///rapids/cudf-22.10.0a0%2B316.gad1ba132d2.dirty-cp38-cp38-linux_x86_64.whl cugraph @ file:///rapids/cugraph-22.10.0a0%2B113.g6bbdadf8.dirty-cp38-cp38-linux_x86_64.whl cuml @ file:///rapids/cuml-22.10.0a0%2B56.g3a8dea659.dirty-cp38-cp38-linux_x86_64.whl cupy-cuda118 @ file:///rapids/cupy_cuda118-11.0.0-cp38-cp38-linux_x86_64.whl cycler==0.11.0 cymem==2.0.7 Cython==0.29.32 dask @ file:///rapids/dask-2022.9.2-py3-none-any.whl dask-cuda @ file:///rapids/dask_cuda-22.10.0a0%2B23.g62a1ee8-py3-none-any.whl dask-cudf @ file:///rapids/dask_cudf-22.10.0a0%2B316.gad1ba132d2.dirty-py3-none-any.whl debugpy==1.6.4 decorator==5.1.1 defusedxml==0.7.1 distributed @ file:///rapids/distributed-2022.9.2-py3-none-any.whl entrypoints==0.4 exceptiongroup==1.0.4 execnet==1.9.0 executing==1.2.0 expecttest==0.1.3 fastapi==0.110.1 fastjsonschema==2.16.2 fastrlock==0.8.1 Flask==3.0.3 Flask-RESTful==0.3.10 fonttools==4.38.0 frozenlist==1.4.1 fsspec==2022.11.0 ftfy==6.1.1 google-auth==2.15.0 google-auth-oauthlib==0.4.6 graphsurgeon @ file:///workspace/TensorRT-8.5.1.7/graphsurgeon/graphsurgeon-0.4.6-py2.py3-none-any.whl grpcio==1.51.1 gunicorn==20.1.0 h11==0.14.0 HeapDict==1.0.1 httptools==0.6.1 hypothesis==5.35.1 idna==3.4 importlib-metadata==5.1.0 importlib-resources==5.10.1 iniconfig==1.1.1 intel-openmp==2021.4.0 ipykernel==6.19.2 ipython==8.7.0 ipython-genutils==0.2.0 itsdangerous==2.2.0 jedi==0.18.2 Jinja2==3.1.2 joblib==1.2.0 json5==0.9.10 jsonschema==4.17.3 jupyter-tensorboard @ git+https://github.com/cliffwoolley/jupyter_tensorboard.git@ffa7e26138b82549453306e06b535a9ac36db17a jupyter_client==7.4.8 jupyter_core==5.1.0 jupyterlab==2.3.2 jupyterlab-pygments==0.2.2 jupyterlab-server==1.2.0 jupytext==1.14.4 kiwisolver==1.4.4 kornia==0.7.2 kornia_rs==0.1.3 langcodes==3.3.0 librosa==0.9.2 llvmlite==0.39.1 locket==1.0.0 Markdown==3.4.1 markdown-it-py==2.1.0 MarkupSafe==2.1.1 matplotlib==3.6.2 matplotlib-inline==0.1.6 mdit-py-plugins==0.3.3 mdurl==0.1.2 mistune==2.0.4 mkl==2021.1.1 mkl-devel==2021.1.1 mkl-include==2021.1.1 mock==4.0.3 mpmath==1.2.1 msgpack==1.0.4 multidict==6.0.5 murmurhash==1.0.9 nbclient==0.7.2 nbconvert==7.2.6 nbformat==5.7.0 nest-asyncio==1.5.6 networkx==2.6.3 notebook==6.4.10 numba==0.56.4 numpy==1.22.2 nvgpu==0.9.0 nvidia-dali-cuda110==1.20.0 nvidia-pyindex==1.0.9 nvtx==0.2.5 oauthlib==3.2.2 onnx @ file:///opt/pytorch/pytorch/third_party/onnx opencv @ file:///opencv-4.6.0/modules/python/package packaging==22.0 pandas==1.5.3 pandocfilters==1.5.0 parso==0.8.3 partd==1.3.0 pathy==0.10.1 pexpect==4.8.0 pickleshare==0.7.5 pillow==10.2.0 pillow-avif-plugin==1.4.2 pillow-heif==0.14.0 pkgutil_resolve_name==1.3.10 platformdirs==2.6.0 pluggy==1.0.0 polygraphy==0.43.1 pooch==1.6.0 preshed==3.0.8 prettytable==3.5.0 prometheus-client==0.15.0 prompt-toolkit==3.0.36 protobuf==3.20.1 psutil==5.9.4 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow @ file:///rapids/pyarrow-9.0.0-cp38-cp38-linux_x86_64.whl pyasn1==0.4.8 pyasn1-modules==0.2.8 pybind11==2.10.1 pycocotools @ git+https://github.com/nvidia/cocoapi.git@8b8fd68576675c3ee77402e61672d65a7d826ddf#subdirectory=PythonAPI pycparser==2.21 pydantic==1.9.2 Pygments==2.13.0 pylibcugraph @ file:///rapids/pylibcugraph-22.10.0a0%2B113.g6bbdadf8.dirty-cp38-cp38-linux_x86_64.whl pylibraft @ file:///rapids/pylibraft-22.10.0a0%2B81.g08abc72.dirty-cp38-cp38-linux_x86_64.whl pynvml==11.4.1 pyparsing==3.0.9 pyrsistent==0.19.2 pytest==7.2.0 pytest-rerunfailures==10.3 pytest-shard==0.1.2 pytest-xdist==3.1.0 python-dateutil==2.8.2 python-dotenv==1.0.1 python-hostlist==1.22 python-multipart==0.0.5 pytorch-quantization==2.1.2 pytz==2022.6 PyYAML==6.0 pyzmq==24.0.1 raft-dask @ file:///rapids/raft_dask-22.10.0a0%2B81.g08abc72.dirty-cp38-cp38-linux_x86_64.whl regex==2022.10.31 requests==2.28.2 requests-oauthlib==1.3.1 resampy==0.4.2 rmm @ file:///rapids/rmm-22.10.0a0%2B38.ge043158.dirty-cp38-cp38-linux_x86_64.whl rsa==4.9 scikit-learn @ file:///rapids/scikit_learn-0.24.2-cp38-cp38-manylinux2010_x86_64.whl scipy==1.6.3 Send2Trash==1.8.0 six==1.16.0 smart-open==6.3.0 sniffio==1.3.1 sortedcontainers==2.4.0 soundfile==0.11.0 soupsieve==2.3.2.post1 spacy==3.4.4 spacy-legacy==3.0.10 spacy-loggers==1.0.4 sphinx-glpi-theme==0.3 srsly==2.4.5 stack-data==0.6.2 starlette==0.37.2 sympy==1.11.1 tabulate==0.9.0 tbb==2021.7.1 tblib==1.7.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorrt @ file:///workspace/TensorRT-8.5.1.7/python/tensorrt-8.5.1.7-cp38-none-linux_x86_64.whl termcolor==2.4.0 terminado==0.17.1 thinc==8.1.5 threadpoolctl==3.1.0 tinycss2==1.2.1 tinydb==4.7.0 toml==0.10.2 tomli==2.0.1 toolz==0.12.0 torch==1.14.0a0+410ce96 torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/py/dist/torch_tensorrt-1.3.0a0-cp38-cp38-linux_x86_64.whl torchserve==0.10.0 torchtext @ git+https://github.com/pytorch/text@fae8e8cabf7adcbbc2f09c0520216288fd53f33b torchvision @ file:///opt/pytorch/vision tornado==6.1 tqdm==4.64.1 traitlets==5.7.1 transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@73166c4e3f6cf0e754045ba22ff461ef96453aeb treelite @ file:///rapids/treelite-2.4.0-py3-none-manylinux2014_x86_64.whl treelite-runtime @ file:///rapids/treelite_runtime-2.4.0-py3-none-manylinux2014_x86_64.whl typer==0.7.0 types-python-dateutil==2.9.0.20240316 typing_extensions==4.11.0 ucx-py @ file:///rapids/ucx_py-0.27.0a0%2B29.ge9e81f8-cp38-cp38-linux_x86_64.whl uff @ file:///workspace/TensorRT-8.5.1.7/uff/uff-0.6.9-py2.py3-none-any.whl urllib3==1.26.13 uvicorn==0.20.0 uvloop==0.19.0 wasabi==0.10.1 watchfiles==0.21.0 wcwidth==0.2.5 webencodings==0.5.1 websockets==12.0 Werkzeug==3.0.2 xdoctest==1.0.2 xgboost @ file:///rapids/xgboost-1.6.2-cp38-cp38-linux_x86_64.whl yarl==1.9.4 zict==2.2.0 zipp==3.11.0
Repro instructions
Unfortunately, I can't find a way to reproduce the error, it randomly appears every 1-5 days.
Possible Solution
There are a few things that are a bit odd about this issue:
I've run out of ideas, any thought or feedback would be much appreciated.