This appears to be the same issue at the core as the hanging unit tests, but for simplicity will open sep issue.
Repro steps:
update example/ffn_mx.py with sys.path.append('..') to ensure access to mx module.
bash run_mx6.sh
On single gpu - result is "Done!".
On multi-gpu - hangs.
Stack trace from hang:
(pytorch) ubuntu@ip-172-31-66-198:~/microxcaling/examples$ bash run_mx6.sh --full-trace
^CTraceback (most recent call last):
File "/home/ubuntu/microxcaling/examples/ffn_mx.py", line 74, in <module>
y = mlp(x)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/microxcaling/examples/ffn_mx.py", line 39, in forward
norm_outputs = self.layernorm(inputs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1519, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/microxcaling/examples/../mx/layernorm.py", line 91, in forward
return LayerNormFunction.apply(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py", line 551, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/ubuntu/microxcaling/examples/../mx/layernorm.py", line 22, in forward
x = vec_quantize(x, mx_specs=mx_specs)
File "/home/ubuntu/microxcaling/examples/../mx/vector_ops.py", line 35, in vec_quantize
return quantize_elemwise_op(input, mx_specs=mx_specs,
File "/home/ubuntu/microxcaling/examples/../mx/elemwise_ops.py", line 253, in quantize_elemwise_op
A = _quantize_bfloat(A, bfloat=mx_specs['bfloat'], round=round,
File "/home/ubuntu/microxcaling/examples/../mx/elemwise_ops.py", line 206, in _quantize_bfloat
return _quantize_elemwise_core(
File "/home/ubuntu/microxcaling/examples/../mx/elemwise_ops.py", line 118, in _quantize_elemwise_core
from . import custom_extensions
File "/home/ubuntu/microxcaling/examples/../mx/custom_extensions.py", line 19, in <module>
funcs = load(name="custom_extensions", sources=sources)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1308, in load
return _jit_compile(
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1724, in _jit_compile
baton.wait()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/file_baton.py", line 42, in wait
time.sleep(self.wait_seconds)
KeyboardInterrupt
Note that I waited over 5 minutes in case this was a long compile situation, and reproed it multiple times.
This appears to be the same issue at the core as the hanging unit tests, but for simplicity will open sep issue.
Repro steps: update example/ffn_mx.py with sys.path.append('..') to ensure access to mx module. bash run_mx6.sh
On single gpu - result is "Done!". On multi-gpu - hangs.
Stack trace from hang:
Note that I waited over 5 minutes in case this was a long compile situation, and reproed it multiple times.