microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.46k stars 2.9k forks source link

CUDAExecutionProvider doesn't seem to be used during inference of transformers exported model to ONNX runtime GPU #22325

Open cooper-a opened 2 weeks ago

cooper-a commented 2 weeks ago

Describe the issue

We are seeing an issue with a Transformer model which was exported using torch.onnx.export and then optimized with optimum ORTOptimizer. Inferencing seems to not be using GPU and only CPU.

Model was exported on CPU machine using ONNX 1.16.0. We see the following logs when starting the inference session.


[transformer_memcpy.cc:74](http://transformer_memcpy.cc:74/)
ApplyImpl] 36
Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph).
Set session_options.log_severity_level=1 to see the detail logs before this message.[m
[0;93m2024-10-04 20:35:10.514629537 [W:onnxruntime:,
[session_state.cc:1166](http://session_state.cc:1166/)
VerifyEachNodeIsAssignedToAnEp]
Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance.
e.g. ORT explicitly assigns shape related ops to CPU to improve perf.[m```

### To reproduce

Model was deployed onto Ubuntu2004 Docker Container running the Azure SKU [NCasT4_v3](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ncast4v3-series?tabs=sizebasic )with the following versions:

onnxruntime-gpu 1.18.0
CUDA Version: 11.8
cuDNN Version 8.9.6.50

**Inferencing Code**

```class UnifiedModelOnnx(BaseModel):
    def __init__(self, model_root, gpu_mem_limit=12, device=None):
        self.model_path = os.path.join(model_root, "model.onnx")

        self.tokenizer = DebertaV2Tokenizer.from_pretrained(model_root)
        self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") if device is None else device
        print(f"Using device {self.device}")
        providers = [
            (
                "CUDAExecutionProvider",
                {
                    "device_id": 0,
                    "arena_extend_strategy": "kNextPowerOfTwo",
                    "gpu_mem_limit": gpu_mem_limit * 1024 * 1024 * 1024,
                    "cudnn_conv_algo_search": "EXHAUSTIVE",
                    "do_copy_in_default_stream": True,
                },
            )
        ]
        ort_session_options = ort.SessionOptions()
        ort_session_options.enable_cpu_mem_arena = False
        self.ort_session = ort.InferenceSession(self.model_path, providers=providers)
        self.run_config = ort.RunOptions().add_run_config_entry("memory.enable_memory_arena_shrinkage", "gpu:0")```

We also ran `ort.get_available_providers()` showing: `['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'AzureExecutionProvider', 'CPUExecutionProvider']`

And `self.ort_session.get_providers()` showing: `['CUDAExecutionProvider', 'CPUExecutionProvider']}]`

### Urgency

Internal Microsoft team alias for contact cooperang@microsoft.com or svillaveces@microsoft.com

### Platform

Linux

### OS Version

Ubuntu 2004

### ONNX Runtime Installation

Released Package

### ONNX Runtime Version or Commit ID

1.18.0

### ONNX Runtime API

Python

### Architecture

X86

### Execution Provider

CUDA

### Execution Provider Library Version

CUDA Version: 11.8 cuDNN Version 8.9.6.50
tianleiwu commented 2 weeks ago

Memcpy nodes is to copy between devices (like GPU and CPU). So it's using both GPU and CPU.

Could you share some information to reproduce (like transformers/optimum/pytorch versions, and python script or command line for onnx export and optimize)?