pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.22k stars 863 forks source link

CUDA out of Memory with low Memory Utilization (CUDA error: device-side assert triggered) #3114

Open emilwallner opened 7 months ago

emilwallner commented 7 months ago

šŸ› Describe the bug

Hey,

First of all, thanks for creating such a fantastic open-source production server.

I'm reaching out due an unexpected issue I can't solve. I've been running a torch serve server in production for over a year (several million requests per week) and it's been working great, however, a few weeks ago it started crashing every 1-5 days.

I enabled export CUDA_LAUNCH_BLOCKING=1, and it gives me a CUDA error: device-side assert triggered, and CUDA out of memory when I move my data to the GPU. I also log, torch.cuda.max_memory_allocated(), and torch.cuda.memory_allocated().

I thought some unique edge case caused a memory leak, some mismatched shapes or NaN values when I moved to the GPU, or allocating too much memory. However, the models use 6180 MiB / 23028 MiB, and torch.cuda.max_memory_allocated logs around 366 MB.

When I SSH into an instance that has crashed it looks like this:

https://github.com/pytorch/serve/assets/12543699/a6291f5e-ff26-448e-a216-cdc90029b6ed

The memory is at 6180 MiB, the GPU utilization flickers between 0-16%, and it gives me the CUDA error: device-side assert triggered, and CUDA out of memory.

Unfortunately, I can't find a way to reproduce the error, it happens at random every 1-5 days, and I have to reset the server and allocate a new instance. I've done everything I can think of to check the data before allocating it to the GPU, and reducing any memory overload or potential memory leak.

Error logs

Screenshot 2024-04-25 at 22 28 59 Screenshot 2024-04-25 at 22 29 11

Installation instructions

torchserve==0.10.0

Docker image: nvcr.io/nvidia/pytorch:22.12-py3

Ubuntu 20.04 including Python 3.8 NVIDIA CUDAĀ® 11.8.0 NVIDIA cuBLAS 11.11.3.6 NVIDIA cuDNN 8.7.0.84 NVIDIA NCCL 2.15.5 (optimized for NVIDIA NVLinkĀ®) NVIDIA RAPIDSā„¢ 22.10.01 (For x86, only these libraries are included: cudf, xgboost, rmm, cuml, and cugraph.) Apex rdma-core 36.0 NVIDIA HPC-X 2.13 OpenMPI 4.1.4+ GDRCopy 2.3 TensorBoard 2.9.0 Nsight Compute 2022.3.0.0 Nsight Systems 2022.4.2.1 NVIDIA TensorRTā„¢ 8.5.1 Torch-TensorRT 1.1.0a0 NVIDIA DALIĀ® 1.20.0 MAGMA 2.6.2 JupyterLab 2.3.2 including Jupyter-TensorBoard TransformerEngine 0.3.0

Model Packaing


   def create_pil_image(self, image_data):
          try:
              image = Image.open(io.BytesIO(image_data)).convert("RGB")
              return image
          except IOError as e:
              # If the image data is not valid or not provided, create a blank image.
              width, height = 776, 776  # Set desired width and height for the blank image
              color = (255, 255, 255)  # Set desired color for the blank image (white in this case)
              image = Image.new("RGB", (width, height), color)
              return image

     def preprocess_and_stack_images(self, images):
        preprocessed_images = []
        for i, img in enumerate(images):
            try:
                preprocessed_img = self.resize_tensor(img)
                if preprocessed_img.shape != (3, 768, 768) or preprocessed_img.min() < 0 or preprocessed_img.max() > 1:
                    # Log information about the image that doesn't meet the requirements
                    logger.info(f"Image {i} does not meet the requirements. Replacing with a blank image.")
                    preprocessed_img = torch.zeros((3, 768, 768))
            except Exception as e:
                # Log the error message and load a blank image
                logger.error(f"Error occurred while processing Image {i}: {str(e)}. Loading a blank image.")
                preprocessed_img = torch.zeros((3, 768, 768))
            preprocessed_images.append(preprocessed_img)

        images_batch = torch.stack(preprocessed_images, dim=0)
        if len(images_batch.shape) == 3:
            images_batch = images_batch.unsqueeze(0)
        return images_batch

    def preprocess(self, data):

        images = []
        fns = []
        texts = []
        size = []
        merges = []
        org_images = []
        watermarks = []
        white_balance_list = []
        auto_color_list = []
        temperature_list = []
        saturation_list = []

        for row in data:

            image = row["image"]
            fn = self.decode_field(row["fn"])
            text = self.decode_field(row["text"])
            merged = self.decode_field(row["merged"])
            merged = True if merged.lower() == 'true' else False
            resolution = self.decode_field(row["resolution"])

            white_balance = self.decode_field(row["white_balance"])
            auto_color = self.decode_field(row["auto_color"])
            temperature = float(self.decode_field(row["temperature"]))
            saturation = float(self.decode_field(row["saturation"]))

            auto_color =  True if auto_color == 'true' else False
            white_balance = True if white_balance == 'true' else False
            watermark = True if 'watermarked' in resolution else False

            if isinstance(image, str):
                logger.info(f"Image data should not be a string. Please provide the image data as bytes.")
                width, height = 224, 224  # Set desired width and height for the blank image
                color = (255, 255, 255)  # Set desired color for the blank image (white in this case)
                image = Image.new("RGB", (width, height), color)
            if isinstance(image, (bytearray, bytes)):
                image = self.create_pil_image(image)
                image = self.resize_image(image, resolution)

            org_images.append(image)
            texts.append(text)
            images.append(image)
            fns.append(fn)
            merges.append(merged)
            watermarks.append(watermark)
            white_balance_list.append(white_balance)
            temperature_list.append(temperature)
            saturation_list.append(saturation)
            auto_color_list.append(auto_color)

        texts_raw = self.tokenizer(texts) #type(torch.int32)
        texts = self.token_embedding(texts_raw).type(torch.float16) 
        texts = texts + self.positional_embedding.type(torch.float16)

        images_batch = self.preprocess_and_stack_images(images)

The error comes when I move the images_batch to GPU

config.properties

inference_address=http://0.0.0.0:8510 management_address=http://0.0.0.0:8511 metrics_address=https://0.0.0.0:8512 number_of_netty_threads=8 netty_client_threads=8 async_logging=true enable_metrics_api=false default_workers_per_model=1 max_request_size=20000000 max_response_size=20000000 job_queue_size=100 model_store=./model_store load_models=all models={\ "palette_caption": {\ "1.0": {\ "defaultVersion": true,\ "marName": "palette_caption.mar",\ "minWorkers": 1,\ "maxWorkers": 3,\ "batchSize": 4,\ "maxBatchDelay": 20,\ "responseTimeout": 180\ }\ },\ "palette_colorizer": {\ "1.0": {\ "defaultVersion": true,\ "marName": "palette_colorizer.mar",\ "minWorkers": 2,\ "maxWorkers": 4,\ "batchSize": 4,\ "maxBatchDelay": 20,\ "responseTimeout": 120\ }\ },\ "palette_ref_colorizer": {\ "1.0": {\ "defaultVersion": true,\ "marName": "palette_ref_colorizer.mar",\ "minWorkers": 1,\ "maxWorkers": 2,\ "batchSize": 4,\ "maxBatchDelay": 20,\ "responseTimeout": 120\ }\ }\ }

Versions

Pip freeze:

absl-py==1.3.0 aiohttp==3.8.4 aiosignal==1.3.1 aniso8601==9.0.1 annoy==1.17.1 ansi2html==1.9.1 anyio==4.3.0 apex==0.1 appdirs==1.4.4 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 arrow==1.3.0 asttokens==2.2.1 astunparse==1.6.3 async-timeout==4.0.3 attrs==22.1.0 audioread==3.0.0 backcall==0.2.0 beautifulsoup4==4.11.1 bleach==5.0.1 blinker==1.7.0 blis==0.7.9 cachetools==5.2.0 catalogue==2.0.8 certifi==2022.12.7 cffi==1.15.1 charset-normalizer==2.1.1 click==8.1.3 cloudpickle==2.2.0 cmake==3.24.1.1 comm==0.1.2 confection==0.0.3 contourpy==1.0.6 cuda-python @ file:///rapids/cuda_python-11.7.0%2B0.g95a2041.dirty-cp38-cp38-linux_x86_64.whl cudf @ file:///rapids/cudf-22.10.0a0%2B316.gad1ba132d2.dirty-cp38-cp38-linux_x86_64.whl cugraph @ file:///rapids/cugraph-22.10.0a0%2B113.g6bbdadf8.dirty-cp38-cp38-linux_x86_64.whl cuml @ file:///rapids/cuml-22.10.0a0%2B56.g3a8dea659.dirty-cp38-cp38-linux_x86_64.whl cupy-cuda118 @ file:///rapids/cupy_cuda118-11.0.0-cp38-cp38-linux_x86_64.whl cycler==0.11.0 cymem==2.0.7 Cython==0.29.32 dask @ file:///rapids/dask-2022.9.2-py3-none-any.whl dask-cuda @ file:///rapids/dask_cuda-22.10.0a0%2B23.g62a1ee8-py3-none-any.whl dask-cudf @ file:///rapids/dask_cudf-22.10.0a0%2B316.gad1ba132d2.dirty-py3-none-any.whl debugpy==1.6.4 decorator==5.1.1 defusedxml==0.7.1 distributed @ file:///rapids/distributed-2022.9.2-py3-none-any.whl entrypoints==0.4 exceptiongroup==1.0.4 execnet==1.9.0 executing==1.2.0 expecttest==0.1.3 fastapi==0.110.1 fastjsonschema==2.16.2 fastrlock==0.8.1 Flask==3.0.3 Flask-RESTful==0.3.10 fonttools==4.38.0 frozenlist==1.4.1 fsspec==2022.11.0 ftfy==6.1.1 google-auth==2.15.0 google-auth-oauthlib==0.4.6 graphsurgeon @ file:///workspace/TensorRT-8.5.1.7/graphsurgeon/graphsurgeon-0.4.6-py2.py3-none-any.whl grpcio==1.51.1 gunicorn==20.1.0 h11==0.14.0 HeapDict==1.0.1 httptools==0.6.1 hypothesis==5.35.1 idna==3.4 importlib-metadata==5.1.0 importlib-resources==5.10.1 iniconfig==1.1.1 intel-openmp==2021.4.0 ipykernel==6.19.2 ipython==8.7.0 ipython-genutils==0.2.0 itsdangerous==2.2.0 jedi==0.18.2 Jinja2==3.1.2 joblib==1.2.0 json5==0.9.10 jsonschema==4.17.3 jupyter-tensorboard @ git+https://github.com/cliffwoolley/jupyter_tensorboard.git@ffa7e26138b82549453306e06b535a9ac36db17a jupyter_client==7.4.8 jupyter_core==5.1.0 jupyterlab==2.3.2 jupyterlab-pygments==0.2.2 jupyterlab-server==1.2.0 jupytext==1.14.4 kiwisolver==1.4.4 kornia==0.7.2 kornia_rs==0.1.3 langcodes==3.3.0 librosa==0.9.2 llvmlite==0.39.1 locket==1.0.0 Markdown==3.4.1 markdown-it-py==2.1.0 MarkupSafe==2.1.1 matplotlib==3.6.2 matplotlib-inline==0.1.6 mdit-py-plugins==0.3.3 mdurl==0.1.2 mistune==2.0.4 mkl==2021.1.1 mkl-devel==2021.1.1 mkl-include==2021.1.1 mock==4.0.3 mpmath==1.2.1 msgpack==1.0.4 multidict==6.0.5 murmurhash==1.0.9 nbclient==0.7.2 nbconvert==7.2.6 nbformat==5.7.0 nest-asyncio==1.5.6 networkx==2.6.3 notebook==6.4.10 numba==0.56.4 numpy==1.22.2 nvgpu==0.9.0 nvidia-dali-cuda110==1.20.0 nvidia-pyindex==1.0.9 nvtx==0.2.5 oauthlib==3.2.2 onnx @ file:///opt/pytorch/pytorch/third_party/onnx opencv @ file:///opencv-4.6.0/modules/python/package packaging==22.0 pandas==1.5.3 pandocfilters==1.5.0 parso==0.8.3 partd==1.3.0 pathy==0.10.1 pexpect==4.8.0 pickleshare==0.7.5 pillow==10.2.0 pillow-avif-plugin==1.4.2 pillow-heif==0.14.0 pkgutil_resolve_name==1.3.10 platformdirs==2.6.0 pluggy==1.0.0 polygraphy==0.43.1 pooch==1.6.0 preshed==3.0.8 prettytable==3.5.0 prometheus-client==0.15.0 prompt-toolkit==3.0.36 protobuf==3.20.1 psutil==5.9.4 ptyprocess==0.7.0 pure-eval==0.2.2 pyarrow @ file:///rapids/pyarrow-9.0.0-cp38-cp38-linux_x86_64.whl pyasn1==0.4.8 pyasn1-modules==0.2.8 pybind11==2.10.1 pycocotools @ git+https://github.com/nvidia/cocoapi.git@8b8fd68576675c3ee77402e61672d65a7d826ddf#subdirectory=PythonAPI pycparser==2.21 pydantic==1.9.2 Pygments==2.13.0 pylibcugraph @ file:///rapids/pylibcugraph-22.10.0a0%2B113.g6bbdadf8.dirty-cp38-cp38-linux_x86_64.whl pylibraft @ file:///rapids/pylibraft-22.10.0a0%2B81.g08abc72.dirty-cp38-cp38-linux_x86_64.whl pynvml==11.4.1 pyparsing==3.0.9 pyrsistent==0.19.2 pytest==7.2.0 pytest-rerunfailures==10.3 pytest-shard==0.1.2 pytest-xdist==3.1.0 python-dateutil==2.8.2 python-dotenv==1.0.1 python-hostlist==1.22 python-multipart==0.0.5 pytorch-quantization==2.1.2 pytz==2022.6 PyYAML==6.0 pyzmq==24.0.1 raft-dask @ file:///rapids/raft_dask-22.10.0a0%2B81.g08abc72.dirty-cp38-cp38-linux_x86_64.whl regex==2022.10.31 requests==2.28.2 requests-oauthlib==1.3.1 resampy==0.4.2 rmm @ file:///rapids/rmm-22.10.0a0%2B38.ge043158.dirty-cp38-cp38-linux_x86_64.whl rsa==4.9 scikit-learn @ file:///rapids/scikit_learn-0.24.2-cp38-cp38-manylinux2010_x86_64.whl scipy==1.6.3 Send2Trash==1.8.0 six==1.16.0 smart-open==6.3.0 sniffio==1.3.1 sortedcontainers==2.4.0 soundfile==0.11.0 soupsieve==2.3.2.post1 spacy==3.4.4 spacy-legacy==3.0.10 spacy-loggers==1.0.4 sphinx-glpi-theme==0.3 srsly==2.4.5 stack-data==0.6.2 starlette==0.37.2 sympy==1.11.1 tabulate==0.9.0 tbb==2021.7.1 tblib==1.7.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorrt @ file:///workspace/TensorRT-8.5.1.7/python/tensorrt-8.5.1.7-cp38-none-linux_x86_64.whl termcolor==2.4.0 terminado==0.17.1 thinc==8.1.5 threadpoolctl==3.1.0 tinycss2==1.2.1 tinydb==4.7.0 toml==0.10.2 tomli==2.0.1 toolz==0.12.0 torch==1.14.0a0+410ce96 torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/py/dist/torch_tensorrt-1.3.0a0-cp38-cp38-linux_x86_64.whl torchserve==0.10.0 torchtext @ git+https://github.com/pytorch/text@fae8e8cabf7adcbbc2f09c0520216288fd53f33b torchvision @ file:///opt/pytorch/vision tornado==6.1 tqdm==4.64.1 traitlets==5.7.1 transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@73166c4e3f6cf0e754045ba22ff461ef96453aeb treelite @ file:///rapids/treelite-2.4.0-py3-none-manylinux2014_x86_64.whl treelite-runtime @ file:///rapids/treelite_runtime-2.4.0-py3-none-manylinux2014_x86_64.whl typer==0.7.0 types-python-dateutil==2.9.0.20240316 typing_extensions==4.11.0 ucx-py @ file:///rapids/ucx_py-0.27.0a0%2B29.ge9e81f8-cp38-cp38-linux_x86_64.whl uff @ file:///workspace/TensorRT-8.5.1.7/uff/uff-0.6.9-py2.py3-none-any.whl urllib3==1.26.13 uvicorn==0.20.0 uvloop==0.19.0 wasabi==0.10.1 watchfiles==0.21.0 wcwidth==0.2.5 webencodings==0.5.1 websockets==12.0 Werkzeug==3.0.2 xdoctest==1.0.2 xgboost @ file:///rapids/xgboost-1.6.2-cp38-cp38-linux_x86_64.whl yarl==1.9.4 zict==2.2.0 zipp==3.11.0

Repro instructions

Unfortunately, I can't find a way to reproduce the error, it randomly appears every 1-5 days.

Possible Solution

There are a few things that are a bit odd about this issue:

I've run out of ideas, any thought or feedback would be much appreciated.

mreso commented 7 months ago

Hi @emilwallner, thanks for the extensive issue report.

My thought on this are:

  1. You're looking at the server after the crash, right? Meaning that the worker process has died, gets restarted and and thus memory is back to normal.
  2. I can't find the line from your stack trace in your code but I assume that its basically the next line from your code. Detach does not create a copy of the data so you should still be having a single batch on device.
  3. You're resizing the images with a resolution coming from the requests and then re-resizing the tensor in preprocess_and_stack_images to (3,768,768). Then you're stacking them along the channel dimension creating e.g. (6,768,768) before you add a batch dimension with unsqueeze. Not sure about your model by maybe it does something funky when it gets (1,6,768,768) instead of(2,3,768,768).
  4. What is your batch size? Did you try using batch_size=1 for some time?
  5. In the video there are multiple processes on the GPU, do you use multiple worker for the same model?

Thats all I have for now but happy to continue spitballing and iterating over this until you find s solution!

Best Matthias

emilwallner commented 7 months ago

Really, really appreciate your input, @mreso!

  1. The worker crashes and returns 507 and doesn't recover.
  2. Yeah, I added detach to make sure requires_grad is set to False
  3. Yeah, that could be it
  4. I switched the batch size to 1 following your suggestion. Also, I check that it has the correct type, and final batch size.
  5. Yes, multiple workers per model.

I also realized CUDA_LAUNCH_BLOCKING 1 reduces performance by about 70%, so I'll turn it off for now.

Here's my updated check:

  def preprocess_and_stack_images(self, images):
        preprocessed_images = []

        for i, img in enumerate(images):
            try:
                preprocessed_img = self.resize_tensor(img)

                if preprocessed_img.shape != (3, 768, 768) or preprocessed_img.min() < 0 or preprocessed_img.max() > 1 or preprocessed_img.dtype != torch.float32:
                    # Log information about the image that doesn't meet the requirements
                    logger.info(f"Image {i} does not meet the requirements. Replacing with a blank image.")
                    preprocessed_img = torch.zeros((3, 768, 768))
            except Exception as e:
                # Log the error message and load a blank image
                logger.error(f"Error occurred while processing Image {i}: {str(e)}. Loading a blank image.")
                preprocessed_img = torch.zeros((3, 768, 768))

            preprocessed_images.append(preprocessed_img)

        images_batch = torch.stack(preprocessed_images, dim=0)

        if len(images_batch.shape) == 3:
            images_batch = images_batch.unsqueeze(0)

        # Second test: Check if the size is (1, 3, 768, 768)
        if images_batch.shape != (1, 3, 768, 768):
            # Log information about the batch that doesn't meet the requirements
            logger.info(f"Batch size {images_batch.shape} does not match the required shape (1, 3, 768, 768). Replacing with a blank batch.")
            images_batch = torch.zeros((1, 3, 768, 768))

        return images_batch

Again, really appreciate the brainstorming ā€” letā€™s keep at it until we crack this!

mreso commented 6 months ago

Yeah, performance will suffer significant from CUDA_LAUNCH_BLOCKING as kernels will not run asynchronously. So only activate if really necessary for debugging.

You could try to run the model in a notebook with a (1,6,768,768) input and observe the memory usage compared to (2,3,768,768). Wondering why this actually seem to to work in the first place.

emilwallner commented 6 months ago

I havenā€™t tried the (1,6,768,768) input yet, but since our model is based on three channels, it should throw an error during execution.

Now, I double-check the size (1,3,768,768), dtype, and ensured the values are in the correct range. Despite that, Iā€™m still hitting a CUDA error: device-side assert triggered when moving the batch with images_batch = images_batch.to(self.device).detach()

Got any more suggestions on what might be causing this?

ptrblck commented 6 months ago

Cross-post from here with a stacktrace pointing to a real indexing error.

vhiwase commented 2 months ago

Please check {{management_address}}/models/ endpoint and monitor the following

"jobQueueStatus": { "remainingCapacity": 100, "pendingRequests": 0 }

I found this issue appears randomly when pendingRequests does not increases.