Open tobernat opened 9 months ago
Running into the exact same issue
In our case it was solved by setting vGPU plugin parameters in VMware: https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-vmware-vsphere see also here: https://kb.vmware.com/s/article/2142307
We're seeing this on Azure with A10 GPUs. Does MIG prevent using this cuda function?
Facing similar issues as well, using L40 vGPU (l40_48c profile) running on a VMware ubuntu 22.04 VM.
In our case, it was solved as well by setting the right VM params: https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-vmware-vsphere
In our case, it was solved as well by setting the right VM params: https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index.html#setting-vgpu-plugin-parameters-on-vmware-vsphere
May I know what VM params did you change to fix this? I've added pciPassthru.use64bitMMIO
to TRUE
and pciPassthru.64bitMMIOSizeGB
to 128
and it still didn't work.
Same issue on A40 using vGPU on ESXI 7. Can someone let us know which parameter can fix it? The 2 mentioned by @yxchia98 are not enough. @vhojan @tobernat
I guess you need to set enable_uvm
to 1? (edit: worked for me)
Setting enable_uvm
to 1 wasn't sufficient in our case. Does anyone have another solution?
Setting enable_uvm to 1 wasn't sufficient in my case too.
Description Trying to deploy a HugginFace model, which I successfully converted with TensorRT-LLM (i.e. inference with model engines works in the TRT-LLM container), in Triton Server with tensorrtllm_backend, I always get a CUDA runtime error in cudaDeviceGetDefaultMemPool.
System Ubuntu 22.04.4 LTS Driver Version: 535.104.05 CUDA Version: 12.2
Triton Information nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 Tried both the container from NGC and building it from the tensorrtllm_backend repo. Same behavior. I also tried newer versions and different driver/CUDA versions, always same behavior.
To Reproduce I followed exactly this: https://developer.nvidia.com/blog/optimizing-inference-on-llms-with-tensorrt-llm-now-publicly-available/ with Llama-2-13b-chat-hf. Configuration etc. from the tutorial. Running the model engines in the TensorRT-LLM container works fine (I can see activity on all 4 GPUs when I call nvidia-smi during inference). When I try to run Triton server I get the following error:
Expected behavior Something like this:
Actual behavior