microsoft / microxcaling

PyTorch emulation library for Microscaling (MX)-compatible data formats
MIT License
Activations unit tests (gelu, tanh) hang on both A10 and A100 gpus... #8

lessw2020 commented 7 months ago

I ran the unit test suites but this resulted in hangs...debugging a bit, I isolated it to the and specifically gelu and tanh unit tests.




I thought perhaps it was due to running on A10 gpus, so I changed to A100 but hit the exact same hangs.

I'm running on latest PT nightlies, not sure if that has an impact here or not. From manually running the other tests relevant for LLM's, these all passed with no issues.

rizhao-msft commented 7 months ago

It looks like the first GPU unit test that uses the custom CUDA code is hanging - usually the first such test takes a minute because it builds the CUDA code as a Pytorch extension. If it hangs for more than a minute this could be an error building that code.

Try doing Ctrl+C when it hangs, then scroll all the way up (there's going to be a bunch of pytest messages, but the first message will have the CUDA build error if it is a build issue). There's likely a bunch of CUDA warnings too.

Can you make sure you have the ninja package installed?

lessw2020 commented 7 months ago

Hi @rizhao-msft - thanks for the info above.

I found that running with a single gpu (A100) then the tests all pass with just some warnings about pkg namespace:[cuda-True-False-10-False]
  /data/home/less/miniconda3/lib/python3.9/site-packages/torch/utils/ DeprecationWarning: pkg_resources is deprecated as an API. See
    from pkg_resources import packaging  # type: ignore[attr-defined][cuda-True-False-10-False][cuda-True-False-10-False][cuda-True-False-10-False][cuda-True-False-10-False][cuda-True-False-10-False]
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/ DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/ DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/ DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/ DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google.logging')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/ DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See
  /data/home/less/miniconda3/lib/python3.9/site-packages/pkg_resources/ DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('sphinxcontrib')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See

-- Docs:
============================================================= 186 passed, 16 warnings in 33.01s
lessw2020 commented 7 months ago

I do have ninja installed on both servers. Running with 4 or 8 gpus, I get the hang. using the ctrl+C break you mentioned results in this trace:

============================================================ 40 passed, 3 warnings in 13.45s
lessw2020 commented 7 months ago

just in case - ninja version on both servers is: Requirement already satisfied: ninja in /opt/conda/envs/pytorch/lib/python3.9/site-packages (1.11.1)

rizhao-msft commented 7 months ago

The multi-GPU failure is inside pytorch when it tries to build the custom extensions. Are multiple processes trying to run the test simultaneously? It looks like multiple processes are trying to build the CUDA code at once and there's a race condition. I don't think Pytorch support multiple processes calling cpp_extensions.load() to build the CUDA code simultaneously. For thread safety in our code, we've always had it so one process executes the following to build the CUDA code first:

if torch.distributed.get_rank() == 0:
    import mx.custom_extensions


If that's the problem I don't think there's a solution to the tests hanging in a multi-GPU environment. I will document the thread safety issue above.

lessw2020 commented 7 months ago

I'm not running anything related to torch.distributed here (hence without that there is not a concept of local_rank to do the check you mentioned above) and only the 0 gpu should be active. I'm just firing the test which should only activate gpu:0 regardless of how many gpu's present. Wondering if there's a need to spec how items are moving to cuda.

rizhao-msft commented 7 months ago

Wondering if there's a need to spec how items are moving to cuda.

Do you mean how tensors are moving to the GPU?

In the multi-GPU setup, if you do: import mx.custom_extensions

Does it hang?

lessw2020 commented 7 months ago

Hi @rizhao-msft - Yes, simply importing will generate the hang:

import hang

Gave it about 5 mins to make sure... it's the exact same deadlock:

^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/microxcaling/examples/../mx/", line 19, in <module>
    funcs = load(name="custom_extensions", sources=sources)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/", line 1308, in load
    return _jit_compile(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/", line 1724, in _jit_compile
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/", line 42, in wait
rizhao-msft commented 7 months ago

Interesting, we've tested on multi-GPU machine before and never had this issue. The hang can happen when using torch.distributed to do multi-GPU training, but never just importing it in a single process.

I'll take a look to see if I can reproduce.

lessw2020 commented 7 months ago

cool - for reference, verified that the same simple import above has no issue on single gpu machine (maybe 30 second delay and then ready).

mgolub2-ms commented 7 months ago

Hi @lessw2020

A couple of questions - what pytorch version are you using? Can you try 2.1.0?

Does it still happen when you set CUDA_VISIBLE_DEVICES=0 before the command? Or if you use the PYTHONPATH=PATH_TO_MX_FOLDER?

If you paste the output of pip list format=freeze that might help us track down what is happening too.

mgolub2-ms commented 7 months ago

Wondering if there's a need to spec how items are moving to cuda.

The underlying tensors are still normal pytorch tensors, so nothing special should be needed there.

lessw2020 commented 7 months ago

Hi @mgolub2-ms: I'm running with latest pytorch nightlies (1106). Installing 2.1 now to see if that fixes.

lessw2020 commented 7 months ago

deadlock is the same with 2.1.0 (bit relieved that is the case...going back to nightlies).

>>> import mx.custom_extensions
^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/microxcaling/examples/../mx/", line 19, in <module>
    funcs = load(name="custom_extensions", sources=sources)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/", line 1308, in load
    return _jit_compile(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/", line 1724, in _jit_compile
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/", line 42, in wait
lessw2020 commented 7 months ago

Does it still happen when you set CUDA_VISIBLE_DEVICES=0 before the command?

Yes, no change, same deadlock.

lessw2020 commented 7 months ago

Here's pip freeze for reference on this server.

mgolub2-ms commented 7 months ago

Here's pip freeze for reference on this server. But I can repro this on multiple servers, I don't think there's anything specific here. (A10 EC2, A100 on premise). You can spin up a $5/hr G5 A10 (4 gpu) on EC2 and repro this, might be helpful at this point as then you could investigate directly. One common aspect is I'm running CUDA 12.1 on all servers. I also have triton nightly build (open ai triton) installed on both servers...not sure if either could be an issue/delta, but if you are running only Cuda 11.8 then perhaps that might explain difference?

@rizhao-msft Spun up a multi GPU (2) azure VM and was unable to reproduce the hang - we are continuing to investigate

mgolub2-ms commented 7 months ago

Hmm, this is odd - I've been unable to reproduce the issue either, using cuda 12.1, torch 2.1.0, on a 4 GPU A100 azure VM.

I tried configuring accelerate and install triton/pytorch triton, no issues running the tests, or running the examples. I've also tried setting the path using sys.path.append and PYTHONPATH, to test if there was an issue there, both work for me.

One thing I noticed that is a bit odd - you have some cuda 11 packages and some cuda 12 packages:

nvidia-cuda-cupti-cu11             11.7.101
nvidia-cuda-cupti-cu12             12.1.105
nvidia-cuda-nvrtc-cu11             11.7.99
nvidia-cuda-nvrtc-cu12             12.1.105
nvidia-cuda-runtime-cu11           11.7.99
nvidia-cuda-runtime-cu12           12.1.105
nvidia-nccl-cu11                   2.14.3
nvidia-nccl-cu12                   2.18.1
nvidia-nvjitlink-cu12              12.3.52
nvidia-nvtx-cu11                   11.7.91
nvidia-nvtx-cu12                   12.1.105

For comparison, my environment only has cuda 12 packages:

nvidia-cuda-cupti-cu12                  12.1.105
nvidia-cuda-nvrtc-cu12                  12.1.105
nvidia-cuda-runtime-cu12                12.1.105
nvidia-ml-py                            11.525.112
nvidia-ml-py3                           7.352.0
nvidia-nccl-cu12                        2.18.1
nvidia-nvjitlink-cu12                   12.3.52
nvidia-nvtx-cu12                        12.1.105

Is it possible to try an environment with a single set of cuda packages? Or even a brand new virtualenv and then install the requirements using pip install -r requirements.txt ?

Not sure why multi-gpu would be affected by that vs. a single gpu. It's really interesting that CUDA_VISIBLE_DEVICES does not help - to my knowledge, that basically emulates a single GPU system for any command.

lessw2020 commented 7 months ago

Thanks for the updates @rizhao-msft and @mgolub2-ms. I'll pull a brand new EC2 instance this afternoon to help isolate this.
I believe the multi-cuda (11 and 12) is from the starting AWS AMI (i.e. they have a bundle in there), not something I overtly installed. But, at least starting with a brand new image will help further reduce the search space.

lessw2020 commented 7 months ago

HI @rizhao-msft and @mgolub2-ms
Some good news here. Clean new machine works great. (with Pytorch 2.1). I'll upgrade to nightlies next but at least for now this hang issue is localized to something in my configs.

Here's the pip freeze from new AWS instance + pip install -r requirements.txt so there's a snapshot to diff against:

import sys sys.path.append('..') import mx.custom_extensions exit() (pytorch) ubuntu@ip-172-31-27-89:~/microxcaling/mx$ pip list format=freeze Package Version

lessw2020 commented 7 months ago

ok working great with latest PT nightly (1108) as well. I also don't find any of the cuda libs running around on this new machine, so it seems that having those may have been the issue? (may have come from building PyTorch from source). Anyway, at least for EC2, this issue is resolved. I will setup a new env for on prem and see if that also resolves.

rizhao-msft commented 7 months ago

Good! We also have an alternative way to build the CUDA extensions, on the branch dev/rizhao/prebuilt_extensions. You can switch to that branch, then go to mx/cpp, and run python install. This will install the CUDA extension as a Python package, and it will avoid running the load() function that was hanging.

This could help if you run into the problem again.

lessw2020 commented 7 months ago

That sounds great re: prebuilt - I'll switch to that branch. I'll start converting a toy model tomorrow to start actively running with MX FP6 now that I'm past the hang. I'll go ahead and close this one since it's clear it's not a generic issue like I initially thought based on it showing on 2 diff servers. Thanks for the help on getting past the hanging issue!