mlcommons / ck

Collective Mind (CM) is a small, modular, cross-platform and decentralized workflow automation framework with a human-friendly interface and reusable automation recipes to make it easier to build, run, benchmark and optimize applications and systems across diverse models, data sets, software and hardware
https://cKnowledge.org/install-cm-mlops
Apache License 2.0
600 stars 111 forks source link

Issue with get-cuda-devices on 4x RTX 6000 Ada #916

Closed WarrenSchultz closed 8 months ago

WarrenSchultz commented 1 year ago

Now that I've gotten the prereqs working, I've run into another issue. Trying to run a simple BERT99 test, the get-cuda-devices section failed, so I ran it independently and there was no different result. Thoughts? What other debug logs are there for me to dig up, if that would help?

This is an Intel-based workstation with 4x RTX 6000 Ada-gen cards. Drivers are up to date, matching the versions of the other systems I've tested this on, as well as the same procedure. nvidia-smi

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.103                Driver Version: 537.13       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  | 00000000:47:00.0 Off |                    0 |
| 30%   44C    P8              13W / 300W |    870MiB / 46068MiB |     10%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000 Ada Gene...    On  | 00000000:5E:00.0 Off |                    0 |
| 30%   40C    P8               2W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX 6000 Ada Gene...    On  | 00000000:75:00.0 Off |                    0 |
| 30%   40C    P8              12W / 300W |      0MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX 6000 Ada Gene...    On  | 00000000:A3:00.0  On |                    0 |
| 30%   49C    P8              13W / 300W |    156MiB / 46068MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Output

* cm run script get-cuda-devices
  * cm run script "get cuda _toolkit"
rm: cannot remove 'a.out': No such file or directory

Checking compiler version ...

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

Compiling program ...
Running program ...
========================================================
Print file tmp-run.out:

Error: problem obtaining number of CUDA devices: 2

CM error: Portable CM script failed (name = get-cuda-devices, return code = 256)
arjunsuresh commented 1 year ago

Hi @WarrenSchultz , are both the nvidia-smi and cm run script get-cuda-devices run outside of docker? Sometimes inside the docker container we have issues with Nvidia driver and in that case even nvidia-smi fails. The solution to this is to just exit the container (CONTAINERID), docker start CONTAINERID and docker attach CONTAINERID.

WarrenSchultz commented 1 year ago

Neither are running inside a container, just straight from the Ubuntu shell.

arjunsuresh commented 1 year ago

Thanks for your reply. That's a bit strange because we have never encountered a scenario where nvidia-smi is successful and get-cuda-devices fails. I'm not able to think of any quick solution but is system reboot an option? @gfursin

WarrenSchultz commented 1 year ago

@arjunsuresh I'd rebooted earlier, but checked again now after getting tensorrt installed. Still no luck.

I'd tried completely wiping the CM folder and rebuilding from fresh with the same procedure I used on the other machines again.

WarrenSchultz commented 1 year ago

@arjunsuresh Ok, more data. I went into the BIOS and disabled two of the GPUs, and it ran fine. Enabled the third card. Also fine. Enabled the fourth, and it failed again. With the error "Error: problem obtaining number of CUDA devices: 2", could this be a simple counting error in a script somewhere?

When I ran it with 3 GPUs, I saved the output for reference.

Checking compiler version ...

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0

Compiling program ...

Running program ...

* cm run script get-cuda-devices
  * cm run script "get cuda _toolkit"
GPU Device ID: 0
GPU Name: NVIDIA RTX 6000 Ada Generation
GPU compute capability: 8.9
CUDA driver version: 12.2
CUDA runtime version: 12.2
Global memory: 48305274880
Max clock rate: 2505.000000 MHz
Total amount of shared memory per block: 49152
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor:  1536
Maximum number of threads per block: 1024
Max dimension size of a thread block X: 1024
Max dimension size of a thread block Y: 1024
Max dimension size of a thread block Z: 64
Max dimension size of a grid size X: 2147483647
Max dimension size of a grid size Y: 65535
Max dimension size of a grid size Z: 65535

GPU Device ID: 1
GPU Name: NVIDIA RTX 6000 Ada Generation
GPU compute capability: 8.9
CUDA driver version: 12.2
CUDA runtime version: 12.2
Global memory: 48305274880
Max clock rate: 2505.000000 MHz
Total amount of shared memory per block: 49152
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor:  1536
Maximum number of threads per block: 1024
Max dimension size of a thread block X: 1024
Max dimension size of a thread block Y: 1024
Max dimension size of a thread block Z: 64
Max dimension size of a grid size X: 2147483647
Max dimension size of a grid size Y: 65535
Max dimension size of a grid size Z: 65535

GPU Device ID: 2
GPU Name: NVIDIA RTX 6000 Ada Generation
GPU compute capability: 8.9
CUDA driver version: 12.2
CUDA runtime version: 12.2
Global memory: 48305274880
Max clock rate: 2505.000000 MHz
Total amount of shared memory per block: 49152
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor:  1536
Maximum number of threads per block: 1024
Max dimension size of a thread block X: 1024
Max dimension size of a thread block Y: 1024
Max dimension size of a thread block Z: 64
Max dimension size of a grid size X: 2147483647
Max dimension size of a grid size Y: 65535
Max dimension size of a grid size Z: 65535
arjunsuresh commented 1 year ago

We'll check that @WarrenSchultz. But it's unlikely because we have tested MLPerf on a 4 GPU system.

WarrenSchultz commented 1 year ago

We'll check that @WarrenSchultz. But it's unlikely because we have tested MLPerf on a 4 GPU system.

Thanks. And just to be thorough, I tried switching which was the disabled card, and it didn't affect the outcome, so it's not a card-specific issue.

gfursin commented 1 year ago

Hi @WarrenSchultz . We didn't try the script on machines with 2+ GPUs . It may be a counting issue indeed. Let me check it ...

gfursin commented 1 year ago

By the way, this error happens here: https://github.com/mlcommons/ck/blob/master/cm-mlops/script/get-cuda-devices/print_cuda_devices.cu#L19

cudaGetDeviceCount returns error code 2 .

I see some related discussions at https://github.com/pytorch/pytorch/issues/40671 . However, I don't think we have a problem with driver/cuda mismatch since it works when 1 card is disabled (if I understood correctly).

@WarrenSchultz - maybe you can debug this code on your system to check what happens? Is there a way to print full error in this C++ code? Thanks a lot for your feedback - very appreciated!

WarrenSchultz commented 1 year ago

By the way, this error happens here: https://github.com/mlcommons/ck/blob/master/cm-mlops/script/get-cuda-devices/print_cuda_devices.cu#L19

cudaGetDeviceCount returns error code 2 .

@gfursin If I'm reading this right, that's a memory issue, which doesn't seem right? https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038

cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory or other resources to perform the requested operation.

I see some related discussions at pytorch/pytorch#40671 . However, I don't think we have a problem with driver/cuda mismatch since it works when 1 card is disabled (if I understood correctly).

Yeah, I don't think that's a problem. You are correct. It works on three cards or less, just not four (or maybe more, no way to check that :)

@WarrenSchultz - maybe you can debug this code on your system to check what happens? Is there a way to print full error in this C++ code? Thanks a lot for your feedback - very appreciated!

Unfortunately, my C coding experience is about 20 years out of use at this point. I'm muddling through a bit, but not to the level I should be for proper debugging.

arjunsuresh commented 1 year ago

@WarrenSchultz can you please share how much host memory the system is having?

WarrenSchultz commented 1 year ago

@arjunsuresh Sorry, meant to include that. Originally, it had 64GB available, but I've bumped up the amount allocated to WSL2 to 112GB, but that had no effect.

arjunsuresh commented 1 year ago

Thank you @WarrenSchultz for explaining. Sorry I'm not able to guess any solution here.

This is not a proper solution but one option is to just hardcode the ndev as 1 so that the script doesn't fail here. We can then see if the benchmark run goes well.

WarrenSchultz commented 1 year ago

@arjunsuresh Thanks, I'll give that a shot.

As I was digging around for solutions, I was looking at the tensorrt json file, and was wondering about the value of "accelerators per node" being listed as 1 for the system (with 3 GPUs) (Trimmed a bit of unnecessary detail)

"accelerator_frequency": "2505.000000 MHz",
"accelerator_host_interconnect": "N/A",
"accelerator_interconnect": "N/A",
"accelerator_interconnect_topology": "",
"accelerator_memory_capacity": "44.98779296875 GB",
"accelerator_memory_configuration": "N/A",
"accelerator_model_name": "NVIDIA RTX 6000 Ada Generation",
"accelerator_on-chip_memories": "",
"accelerators_per_node": "1",
"framework": "tensorrt",
"host_memory_capacity": "112G",
"host_memory_configuration": "undefined",
"host_networking": "Gig Ethernet",
"host_processor_caches": "L1d cache: 1.1 MiB (24 instances), L1i cache: 768 KiB (24 instances), L2 cache: 48 MiB (24 instances), L3 cache: 105 MiB (1 instance)",
"host_processor_core_count": "24",
"host_processor_model_name": "Intel(R) Xeon(R) w9-3495X",
"host_processors_per_node": "1",
"host_storage_capacity": "6.2T",
"host_storage_type": "SSD",
"number_of_nodes": "1",
"operating_system": "Ubuntu 22.04 (linux-5.15.90.4-microsoft-standard-WSL2-glibc2.35)",
WarrenSchultz commented 1 year ago

@arjunsuresh Hm. So, bearing in mind this is above my experience level, I tried debugging myself and setting the value to 1, and got the same result. I then fed it through ChatGPT, which came up with code that returned string value of the error "Error: problem obtaining number of CUDA devices: out of memory"

Doing some more looking online, it seems like this may have to do with the model not fitting in GPU memory vs. system memory, (which seems odd that it would work for fewer GPUs), unless it's trying to load the model from all the GPUs into the memory of a single GPU? (Which doesn't particularly make sense, but this is outside my experience by far at this point. :)

I saw some guidance about changing batch size, but didn't have any luck passing those arguments to CM having any effect.

WarrenSchultz commented 1 year ago

@arjunsuresh Well, it appears the root cause is WSL2 not handling that many GPUs properly. https://github.com/microsoft/WSL/issues/10269

From the link: "Observed Behavior: An “Out of Memory” error is triggered internally at ../c10/cuda/CUDAFunctions.cpp:109. The torch.cuda.is_available() function returns False.

Workaround: I found that calling torch.cuda.device_count() before torch.cuda.is_available() circumvents the error. However, this workaround requires modifying each script to include this extra call."

arjunsuresh commented 1 year ago

Thank you for reporting the issue here with the system_json -- we'll fix that but this is only for reporting purposes and should not be affecting the runs.

"I tried debugging myself and setting the value to 1, and got the same result" oh. so the problem happens when 4 GPUs are installed and even when just one of them is accessed. In that case I believe you can try setting this to 0 just to bypass this script and see what error TensorRT reports. Or just use 3 GPUs for now until we have a solution for WSL?

WarrenSchultz commented 1 year ago

I've been digging into this, and posted an update to a thread on the NVIDIA developer forums: https://forums.developer.nvidia.com/t/quad-4x-a6000-wsl2-cuda-init-errors/238106/9

Short version, there's an initialization issue happening at the driver level somewhere. Using export CUDA_VISIBLE_DEVICES=0,1,2,3 will fail, or any combination of 1 and 3 for me. If I toggle the performance counter permissions in the Windows driver, it will allow the get-cuda-devices check to pass, but seems to be problematic on the actual test run.

Unfortunately, I'm doing performance benchmarking, and I need to actually see all four GPUs' performance together.

arjunsuresh commented 1 year ago

Oh okay. Do you think a docker can help?

WarrenSchultz commented 1 year ago

I think it's on the Windows driver level, but it couldn't hurt to try. What do you suggest?

arjunsuresh commented 1 year ago

Sure. Are you following these instructions for running bert? In that case you can just switch to the "using docker" section.

WarrenSchultz commented 1 year ago

Yup, that's what I've been using, I'll give it a shot, thanks. Looking at some of the other posts people made about the issue it seems docker is affected as well, but I'm fine with using docker if I'm lucky enough that it works. :)

WarrenSchultz commented 1 year ago

Same issue with Docker, unfortunately. Was worth trying, thanks.

arjunsuresh commented 1 year ago

Thank you @WarrenSchultz for trying. If it is a driver issue can an older version help? Is dual booting to linux an option?

WarrenSchultz commented 1 year ago

I'm currently discussing with my team if dual-boot is an option. Driver version doesn't seem to matter. I saw a post that Ubuntu 20.04 worked for them, but I had no luck.

WarrenSchultz commented 1 year ago

@arjunsuresh Still discussing if switching to Ubuntu is an option. In the meantime, I was checking things with lower GPU counts. When I rebuilt the environment with a single GPU, then increased it after the initial test to have multiple GPUs, it didn't give an unmatched configuration error, but instead threw an error with the QPS value as a wrong number type.

Trace is below. What is the correct way to generate a new custom spec with CM? I've tried finding it in the docs, but have had no luck.

  * cm run script "get generic-python-lib _transformers"
  * cm run script "get generic-python-lib _safetensors"
  * cm run script "get generic-python-lib _onnx"
  * cm run script "reproduce mlperf inference nvidia harness _build_engine _cuda _tensorrt _bert-99 _offline _bert_"
  * cm run script "reproduce mlperf inference nvidia harness _preprocess_data _cuda _tensorrt _bert-99 _bert_"
Traceback (most recent call last):
  File "/home/ptuser/.local/bin/cm", line 8, in <module>
    sys.exit(run())
  File "/home/ptuser/.local/lib/python3.10/site-packages/cmind/cli.py", line 35, in run
    r = cm.access(argv, out='con')
  File "/home/ptuser/.local/lib/python3.10/site-packages/cmind/core.py", line 546, in access
    r = action_addr(i)
  File "/home/ptuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 1378, in run
    r = customize_code.preprocess(ii)
  File "/home/ptuser/CM/repos/mlcommons@ck/cm-mlops/script/run-mlperf-inference-app/customize.py", line 148, in preprocess
    r = cm.access(ii)
  File "/home/ptuser/.local/lib/python3.10/site-packages/cmind/core.py", line 667, in access
    return cm.access(i)
  File "/home/ptuser/.local/lib/python3.10/site-packages/cmind/core.py", line 546, in access
    r = action_addr(i)
  File "/home/ptuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 1455, in run
    r = prepare_and_run_script_with_postprocessing(run_script_input)
  File "/home/ptuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 3842, in prepare_and_run_script_with_postprocessing
    r = script_automation._call_run_deps(posthook_deps, local_env_keys, local_env_keys_from_meta, env, state, const, const_state,
  File "/home/ptuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2371, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "/home/ptuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 2518, in _run_deps
    r = self.cmind.access(ii)
  File "/home/ptuser/.local/lib/python3.10/site-packages/cmind/core.py", line 546, in access
    r = action_addr(i)
  File "/home/ptuser/CM/repos/mlcommons@ck/cm-mlops/automation/script/module.py", line 1378, in run
    r = customize_code.preprocess(ii)
  File "/home/ptuser/CM/repos/mlcommons@ck/cm-mlops/script/reproduce-mlperf-inference-nvidia/customize.py", line 172, in preprocess
    target_qps = int(target_qps)
ValueError: invalid literal for int() with base 10: '584.632'
arjunsuresh commented 1 year ago

@WarrenSchultz Sorry for the late reply. This is due to a bug in the code which we hadn't noticed because we were always giving the target_qps as input. This PR should fix it.

https://github.com/mlcommons/ck/pull/923