groq / mlagility

Machine Learning Agility (MLAgility) benchmark and benchmarking tools
MIT License
38 stars 11 forks source link

Check NVIDIA driver version to avoid compatibility issues #288

Open jeremyfowers opened 1 year ago

jeremyfowers commented 1 year ago

Trying to run any GPU benchmark on head of main on GCP or Azure yields an error like this:

azureuser@mla-gpu-test:~$ miniconda3/bin/conda run -n mla benchit mlagility/models/selftest/linear.py --device nvidia

Models discovered during profiling:

linear.py:
    model (executed 1x)
        Model Type: Pytorch (torch.nn.Module)
        Class:      LinearTestModel (<class 'linear.LinearTestModel'>)
        Location:   /home/azureuser/mlagility/models/selftest/linear.py, line 21
        Parameters: 110 (<0.1 MB)
        Hash:       d5b1df11
        Status:     Unknown benchit error: 'Total Latency'
        Traceback (most recent call last):
          File "/home/azureuser/mlagility/src/mlagility/analysis/analysis.py", line 133, in call_benchit
            perf = benchmark_model(
          File "/home/azureuser/mlagility/src/mlagility/api/model_api.py", line 145, in benchmark_model
            perf = gpu_model.benchmark(backend=backend)
          File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 21, in benchmark
            benchmark_results = self._execute(repetitions=repetitions, backend=backend)
          File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 84, in _execute
            mean_latency=self.mean_latency,
          File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 43, in mean_latency
            return float(self._get_stat("Total Latency")["mean "].split(" ")[1])
          File "/home/azureuser/mlagility/src/mlagility/api/trtmodel.py", line 34, in _get_stat
            return performance[stat]
        KeyError: 'Total Latency'

GPU benchmarking is known to work correctly on commit e250ac7502c43d24045688b3393874ea9ee4c364

ramkrishna2910 commented 1 year ago

The design to use containers for all benchmarking was due to its promise of including all required dependencies and will not force the user to worry about versions and stuff. For TensorRT to function you need the compatible versions of CUDA, CUDNN and the driver as stated here in this matrix The official Nvidia TensortRT container we use comes packaged with the right version of CUDA and CUDNN, great! But since the driver is a kernel mode component, that cannot come with the container, rather should have been installed on the host system. So far all of the systems we had tested this feature on happened to have the correct drivers. Except for the T4 system Jermey used and the T4 system I found on GCP. Once I updated the driver version everything worked as expected. Ideally, the TRT container should report this error instead of just crashing. The fix on our end should be to read the driver version and report a proper error to update the driver. I will add this to the issue.

Follow the steps here to update drivers