[BUG] FMA instruction lowering issue, slow execution for architectures with no hardware IC

andresnowak commented 4 months ago

Bug description

There seems to be a bug in FMA op when using float16 and bfloat16 values

Steps to reproduce

Code: https://github.com/Benny-Nottonson/Mojo-Marathons/tree/cd3c6f2ba39500ba94736312b55fef36a4c4c4d6. When running main.mojo and running the test in linux I get this error assertionError: 24.4375 is not close to 24.453125 with a diff of 0.015625 mojo: error: execution exited with a non-zero result: 1 for the (1,47,97) matrix shape when using float16, but when using float32 or float64 or changing the basic_matmul in test.mojo to not use the fma operator or running it in MacOs the test runs correctly. Also I get very slow speeds when using the fma op for float16.

This happens in linux on a ryzen 3600x, on a m1 chip this problem doesn't happen

System information

- What OS did you do install Mojo on ? PopOs And MacOs
- Provide version information for Mojo by pasting the output of `mojo -v` 24.4.0
- Provide Modular CLI version by pasting the output of `modular -v` 0.8.0

martinvuyk commented 4 months ago

I think this is because your CPU has no hardware FMA and whatever math library runs on your system emulates it and has precision differences. see: this stackoverflow answer

andresnowak commented 4 months ago

I have a ryzen 3600x and it seems it does have the operator, so I don't think that is the problem.

martinvuyk commented 4 months ago

Then it is that the tolerance is too low, 24.4375 is not close to 24.453125 with a diff of 0.015625 is close depending on the application. And there will always be differences when doing fma against normal multiply and add since fma does 1 rounding and multiply and add do 1 each.

Excerpt from this stackoverflow answer:

Additionally, there's a whole other can of worms about the slight differences in the results from std::fma and (ab)+c due to the way rounding is handled with floating point numbers. std::fma only rounds once during the two floating point operations, while (ab)+c might[1] do a*b, store the result in 64 bits, add c to this value and then store the result in 64 bits.

If you want to minimize floating point arithmetic error in your calculations, std::fma is probably a better choice because it guarantees you will only have precious bits stripped away from your precious floating point numbers once.

[1] Whether or not this extra rounding happens depends on your compiler, your optimization settings and your architecture settings: Compiler Explorer examples for msvc, gcc, icc, clang

andresnowak commented 4 months ago

Hmm okay, but the thing is this problem only happens with float 16 and bfloat16, and like I think a 1e-2 error is already big no. And also when using fma with float 16 the operations are a lot slower, that's what was also strange. So that's why I was thinking it is something else. Maybe there jis an error where the llvm fma intrinsic is not using the fma op in that case?

martinvuyk commented 4 months ago

It could also be that your CPU only has FP32 and/or FP64 units and is converting the FP16 before calculating

Found another stackoverflow answer:

If the hardware implements only double, then float will be slower if conversion to/from the native double format isn't free as part of float-load and float-store instructions.
If the hardware implements float only, then emulating double with it will cost even more time. In this case, float will be faster.
And if the hardware implements neither, and both have to be implemented in software. In this case, both will be slow, but double will be slightly slower (more load and store operations at the least).

I think this is getting into the realm of stuff the compiler really should resolve according to the target microarchitecture

andresnowak commented 4 months ago

Yeah that's why I made the issue, like Maybe this is what is happening (but from what I understand I have support for float and double, but maybe I'm wrong), but I think the compiler should solve the problem depending as you say for the architecture is compiling

martinvuyk commented 4 months ago

float and double

float = FP32 and double = FP64 FP16 is pretty new and niche to Neural Networks so this is likely a problem with them assuming users have a newer chip. I also have a pretty old CPU and fma is slow as a snail but in my case I have no FMA at all

compiler should solve the problem

totally, maybe change the title of the issue to something like "FMA instruction lowering issue, slow execution for architectures with no hardware IC"

andresnowak commented 4 months ago

Hmmm okay true, I understand, yeah looking it seems that i do have fp16 support but for fma it seems it is something newer (it seems zen4 has it and m1 also seems to have it). I'll change the title, thank you.

JoeLoser commented 1 month ago

Is this still an issue?

andresnowak commented 1 month ago

For now i don't have access to this computer, so I won't be able to confirm if the bug still exists, but i think it still does. But this bug was on a ryzen 3600x using PopOs linux (if it helps to replicate the problem).

And it seems the problem is more on the part where the compiler doesn't try to give like the correct code if the cpu doesn't support that type of instruction. I think maybe it should be possible for the compiler to just instead give non fma operation with the fp16 type if the cpu doesn't support it, Well thats what i understand is happening, because the 3600x cpu supposedly has fp16 support, just not fma fp16 support.

modularml / mojo