pytorch / serve

Serve, optimize and scale PyTorch models in production
https://pytorch.org/serve/
Apache License 2.0
4.16k stars 844 forks source link

Memory usage only on 1 GPU out of 4 when loading torch-tensorrt model #2572

Open sachanub opened 1 year ago

sachanub commented 1 year ago

🐛 Describe the bug

I am following this example to perform inference on TorchServe with a torch-tensorrt model: https://github.com/pytorch/serve/tree/master/examples/torch_tensorrt

I am using a custom container (adapted from an existing TorchServe container) which has the following:

  1. torchserve (version 0.8.2)
  2. torch-model-archiver (version 0.8.2)
  3. tensorrt (version 8.5.3.1)
  4. torch_tensorrt (version 1.4.0)
  5. cuDNN (version 8.9.3.28)
  6. CUDA 11.7

I am running this example on a g5dn.24xlarge EC2 instance. It is expected that the model should be loaded on all 4 GPUs (with one worker each). Upon starting TorchServe, the model is loaded successfully and I can get the following inference output:

curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
{
  "tabby": 0.2723647356033325,
  "tiger_cat": 0.13748960196971893,
  "Egyptian_cat": 0.04659610986709595,
  "lynx": 0.00318642589263618,
  "lens_cap": 0.00224193069152534
}

When I run curl -X GET http://localhost:8081/models/res50-trt-fp16, I get the following output:

curl: /opt/conda/lib/libcurl.so.4: no version information available (required by curl)
[
  {
    "modelName": "res50-trt-fp16",
    "modelVersion": "1.0",
    "modelUrl": "res50-trt-fp16.mar",
    "runtime": "python",
    "minWorkers": 4,
    "maxWorkers": 4,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "loadedAtStartup": true,
    "workers": [
      {
        "id": "9000",
        "startTime": "2023-09-06T06:01:00.920Z",
        "status": "READY",
        "memoryUsage": 0,
        "pid": 178,
        "gpu": true,
        "gpuUsage": "gpuId::1 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::5 MiB"
      },
      {
        "id": "9001",
        "startTime": "2023-09-06T06:01:00.921Z",
        "status": "READY",
        "memoryUsage": 0,
        "pid": 175,
        "gpu": true,
        "gpuUsage": "gpuId::2 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::5 MiB"
      },
      {
        "id": "9002",
        "startTime": "2023-09-06T06:01:00.921Z",
        "status": "READY",
        "memoryUsage": 0,
        "pid": 177,
        "gpu": true,
        "gpuUsage": "gpuId::3 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::5 MiB"
      },
      {
        "id": "9003",
        "startTime": "2023-09-06T06:01:00.921Z",
        "status": "READY",
        "memoryUsage": 0,
        "pid": 176,
        "gpu": true,
        "gpuUsage": "gpuId::0 utilization.gpu [%]::0 % utilization.memory [%]::0 % memory.used [MiB]::2152 MiB"
      }
    ],
    "jobQueueStatus": {
      "remainingCapacity": 100,
      "pendingRequests": 0
    }
  }
]

From the above output, it appears that a worker is created on each GPU, however the memory.used field is 5 MB for all GPUs except the one with id 9003 (which has memory.used = 2152 MB)

Running nvidia-smi leads to the following output:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    On  | 00000000:00:1B.0 Off |                    0 |
|  0%   28C    P0              60W / 300W |   2152MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G                    On  | 00000000:00:1C.0 Off |                    0 |
|  0%   23C    P8              16W / 300W |      5MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G                    On  | 00000000:00:1D.0 Off |                    0 |
|  0%   23C    P8              16W / 300W |      5MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G                    On  | 00000000:00:1E.0 Off |                    0 |
|  0%   22C    P8              16W / 300W |      5MiB / 23028MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

As can be seen above, the memory-usage is 5 MB for all GPUs except GPU 0.

Also, when I send an inference request, I see the following in the model server logs:

2023-09-06T06:10:25,671 [WARN ] W-9000-res50-trt-fp16_1.0-stderr MODEL_LOG - WARNING: [Torch-TensorRT] - Input 0 of engine __torch___torchvision_models_resnet_ResNet_trt_engine_ was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.

The warning here suggests that while the inference request is sent for the worker on GPU 1, it eventually gets redirected to GPU 0, further suggesting that the model is only loaded on 1 GPU, not all 4. I request you to please investigate this issue. Thanks!

Error logs

Pasted relevant logs above.

Installation instructions

Provided relevant information above.

Model Packaing

Followed this example: https://github.com/pytorch/serve/tree/master/examples/torch_tensorrt

config.properties

No response

Versions

------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch: 

torchserve==0.8.2
torch-model-archiver==0.8.2

Python version: 3.9 (64-bit runtime)
Python executable: /opt/conda/bin/python

Versions of relevant python libraries:
captum==0.6.0
numpy==1.22.4
nvgpu==0.10.0
psutil==5.9.5
requests==2.31.0
torch==2.0.1
torch-model-archiver==0.8.2
torch-tensorrt==1.4.0
torchaudio==2.0.2
torchdata==0.5.1
torchserve==0.8.2
torchvision==0.15.2
wheel==0.40.0
torch==2.0.1
**Warning: torchtext not present ..
torchvision==0.15.2
torchaudio==2.0.2

Java Version:

OS: Ubuntu 20.04.6 LTS
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: N/A
CMake version: version 3.27.0

Is CUDA available: Yes
CUDA runtime version: 11.7.99
GPU models and configuration: 
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G
GPU 2: NVIDIA A10G
GPU 3: NVIDIA A10G
Nvidia driver version: 535.54.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.3
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.3

Repro instructions

To reproduce, please follow this example: https://github.com/pytorch/serve/tree/master/examples/torch_tensorrt

Possible Solution

No response

agunapal commented 1 year ago

Thanks for reporting. This is interesting. I haven't tried this before. Will try it out and get back to you

sachanub commented 1 year ago

Thanks @agunapal . Please let me know if you require any more information from my end.

agunapal commented 1 year ago

Hi @sachanub This is an issue in Torch-tensort https://github.com/pytorch/TensorRT/issues/2319 cc: @lxning