It looks like ArmV8 ISA has support for bf16, but my M2 Max does not have it, so resorting to bf16 -> f32 conversion and computations in f32. This is 2X slower than f16, but 8X better compared to what I get if I try to run a bf16 model on the M2 (NEON and Metal).
It looks like ArmV8 ISA has support for
bf16
, but my M2 Max does not have it, so resorting tobf16 -> f32
conversion and computations inf32
. This is 2X slower thanf16
, but 8X better compared to what I get if I try to run abf16
model on the M2 (NEON
andMetal
).