Custom CUDA code vs. Pytorch CPU/GPU

microsoft / microxcaling

PyTorch emulation library for Microscaling (MX)-compatible data formats

MIT License

123 stars 14 forks source link

The Pytorch code performs bit shifts by multiplying or dividing by powers of two. This can somehow lead to inaccuracies when the shift is large. We've documented the following on V100 GPUs:

# RShift x by 16 bits
>>> x = torch.tensor([1.], dtype=torch.float32, device='cuda')
>>> e = torch.tensor([16.], dtype=torch.float32, device='cuda')
>>> x * 2**e
tensor([65535.9961], device='cuda:0')   # should be 65536

The inaccuracy isn't a big deal in actual deep learning. But it would trip some unit tests. So we always treat CPU or custom CUDA as the golden reference, not Pytorch GPU.

microsoft / microxcaling

Custom CUDA code vs. Pytorch CPU/GPU #11