gpuopenanalytics / pynvml

Provide Python access to the NVML library for GPU diagnostics
BSD 3-Clause "New" or "Revised" License
214 stars 30 forks source link

[pytest] nvlink related tests fail on a machine without nvlink #24

Open ksangeek opened 4 years ago

ksangeek commented 4 years ago

Describe the bug I see that the tests for nvlink related APIs fail on a machine without nvlink e.g. test_nvml_nvlink_properties(). Looking at the rc pynvml.nvml.NVMLError_NotSupported it is clear that the failure is because of the absence of nvlink. Opening this issue to check if there is a better way to handle these in the tests. Or is it too much of a work to bother about?

Steps/Code to reproduce bug pytest reports these kinds of failures for nvlink related testcases -

__________________________________ test_nvml_nvlink_properties ___________________________________

ngpus = 2
handles = [<pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7f2f299bfc80>, <pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7f2f299bfb70>]

    def test_nvml_nvlink_properties(ngpus, handles):
        for i in range(ngpus):
            for j in range(pynvml.NVML_NVLINK_MAX_LINKS):
>               version = pynvml.nvmlDeviceGetNvLinkVersion(handles[i], j)

test_nvml.py:238:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../../../anaconda3/envs/pynvml_py36/lib/python3.6/site-packages/pynvml/nvml.py:2021: in nvmlDeviceGetNvLinkVersion
    check_return(ret)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

ret = 3

    def check_return(ret):
        if (ret != NVML_SUCCESS):
>           raise NVMLError(ret)
E           pynvml.nvml.NVMLError_NotSupported: Not Supported

../../../../anaconda3/envs/pynvml_py36/lib/python3.6/site-packages/pynvml/nvml.py:366: NVMLError_NotSupported
------------------------------------- Captured stdout setup --------------------------------------
[2 GPUs]
rjzamora commented 4 years ago

Thanks for raising @ksangeek - You are correct that the test suite assumes NVLink is supported on the machine being queried. It certainly makes sense to skip tests that do not apply to the target machine (especially if/when we start introducing CI).

For NVLink, we can probably try to call nvmlDeviceGetNvLinkVersion on the 0th device within a module-level fixture, and then catch the NVMLError_NotSupported error to specify if NVLink is not supported.