Open jeremyfowers opened 1 year ago
The design to use containers for all benchmarking was due to its promise of including all required dependencies and will not force the user to worry about versions and stuff. For TensorRT to function you need the compatible versions of CUDA, CUDNN and the driver as stated here in this matrix The official Nvidia TensortRT container we use comes packaged with the right version of CUDA and CUDNN, great! But since the driver is a kernel mode component, that cannot come with the container, rather should have been installed on the host system. So far all of the systems we had tested this feature on happened to have the correct drivers. Except for the T4 system Jermey used and the T4 system I found on GCP. Once I updated the driver version everything worked as expected. Ideally, the TRT container should report this error instead of just crashing. The fix on our end should be to read the driver version and report a proper error to update the driver. I will add this to the issue.
Follow the steps here to update drivers
Trying to run any GPU benchmark on head of main on GCP or Azure yields an error like this:
GPU benchmarking is known to work correctly on commit e250ac7502c43d24045688b3393874ea9ee4c364