Open andresnowak opened 4 months ago
I think this is because your CPU has no hardware FMA and whatever math library runs on your system emulates it and has precision differences. see: this stackoverflow answer
I have a ryzen 3600x and it seems it does have the operator, so I don't think that is the problem.
Then it is that the tolerance is too low, 24.4375 is not close to 24.453125 with a diff of 0.015625
is close depending on the application. And there will always be differences when doing fma against normal multiply and add since fma does 1 rounding and multiply and add do 1 each.
Excerpt from this stackoverflow answer:
Additionally, there's a whole other can of worms about the slight differences in the results from std::fma and (ab)+c due to the way rounding is handled with floating point numbers. std::fma only rounds once during the two floating point operations, while (ab)+c might[1] do a*b, store the result in 64 bits, add c to this value and then store the result in 64 bits.
If you want to minimize floating point arithmetic error in your calculations, std::fma is probably a better choice because it guarantees you will only have precious bits stripped away from your precious floating point numbers once.
[1] Whether or not this extra rounding happens depends on your compiler, your optimization settings and your architecture settings: Compiler Explorer examples for msvc, gcc, icc, clang
Hmm okay, but the thing is this problem only happens with float 16 and bfloat16, and like I think a 1e-2 error is already big no. And also when using fma with float 16 the operations are a lot slower, that's what was also strange. So that's why I was thinking it is something else. Maybe there jis an error where the llvm fma intrinsic is not using the fma op in that case?
It could also be that your CPU only has FP32 and/or FP64 units and is converting the FP16 before calculating
Found another stackoverflow answer:
I think this is getting into the realm of stuff the compiler really should resolve according to the target microarchitecture
Yeah that's why I made the issue, like Maybe this is what is happening (but from what I understand I have support for float and double, but maybe I'm wrong), but I think the compiler should solve the problem depending as you say for the architecture is compiling
float and double
float = FP32 and double = FP64 FP16 is pretty new and niche to Neural Networks so this is likely a problem with them assuming users have a newer chip. I also have a pretty old CPU and fma is slow as a snail but in my case I have no FMA at all
compiler should solve the problem
totally, maybe change the title of the issue to something like "FMA instruction lowering issue, slow execution for architectures with no hardware IC"
Hmmm okay true, I understand, yeah looking it seems that i do have fp16 support but for fma it seems it is something newer (it seems zen4 has it and m1 also seems to have it). I'll change the title, thank you.
Is this still an issue?
For now i don't have access to this computer, so I won't be able to confirm if the bug still exists, but i think it still does. But this bug was on a ryzen 3600x using PopOs linux (if it helps to replicate the problem).
And it seems the problem is more on the part where the compiler doesn't try to give like the correct code if the cpu doesn't support that type of instruction. I think maybe it should be possible for the compiler to just instead give non fma operation with the fp16 type if the cpu doesn't support it, Well thats what i understand is happening, because the 3600x cpu supposedly has fp16 support, just not fma fp16 support.
Bug description
There seems to be a bug in FMA op when using float16 and bfloat16 values
Steps to reproduce
Code: https://github.com/Benny-Nottonson/Mojo-Marathons/tree/cd3c6f2ba39500ba94736312b55fef36a4c4c4d6. When running main.mojo and running the test in linux I get this error
assertionError: 24.4375 is not close to 24.453125 with a diff of 0.015625 mojo: error: execution exited with a non-zero result: 1
for the (1,47,97) matrix shape when using float16, but when using float32 or float64 or changing the basic_matmul in test.mojo to not use the fma operator or running it in MacOs the test runs correctly. Also I get very slow speeds when using the fma op for float16.This happens in linux on a ryzen 3600x, on a m1 chip this problem doesn't happen
System information