microsoft / microxcaling

PyTorch emulation library for Microscaling (MX)-compatible data formats
MIT License
123 stars 14 forks source link

Custom CUDA code vs. Pytorch CPU/GPU #11

Closed rjfnobre closed 6 months ago

rjfnobre commented 7 months ago

Thanks for making such a nice library openly available.

I have a couple of questions regarding the following claim in the REAME of the project: "The custom CUDA code is faster, and in the case of MX more numerically accurate than pytorch GPU."

I understand that the custom CUDA code is faster, but why is it more numerically accurate than Pytorch GPU? What about Pytorch CPU, is it also less accurate than the custom CUDA code?

rizhao-msft commented 6 months ago

The Pytorch code performs bit shifts by multiplying or dividing by powers of two. This can somehow lead to inaccuracies when the shift is large. We've documented the following on V100 GPUs:

# RShift x by 16 bits
>>> x = torch.tensor([1.], dtype=torch.float32, device='cuda')
>>> e = torch.tensor([16.], dtype=torch.float32, device='cuda')
>>> x * 2**e
tensor([65535.9961], device='cuda:0')   # should be 65536

The inaccuracy isn't a big deal in actual deep learning. But it would trip some unit tests. So we always treat CPU or custom CUDA as the golden reference, not Pytorch GPU.