Would you be open to contributions that improve support for BFloat16?
Examples:
If bfloat: 16 and device supports torch.bfloat16, cast instead of emulate
Allow custom CUDA to work with torch.bfloat16 tensors:
Af first by casting them to float, performing the operation, then casting them back to bfloat16
Then, where applicable, adding BFloat16 operations to speed up emulation. (I believe this should be possible for MX types with scale_bits <= 8 and element formats which use <=8 bits)
Would you be open to contributions that improve support for BFloat16?
Examples:
bfloat: 16
and device supportstorch.bfloat16
, cast instead of emulatetorch.bfloat16
tensors:float
, performing the operation, then casting them back tobfloat16
scale_bits <= 8
and element formats which use <=8 bits)