Open Max191 opened 1 year ago
Please also add the iree-cpuinfo
of both machines. And results with / without the host flag.
Please also add the
iree-cpuinfo
of both machines. And results with / without the host flag.
Updated
Could you try with --iree-llvmcpu-reassociate-fp-reductions=false
and see if the differences go away?
It's not clear that anything here is not "working as intended". When codegen is specialized for different ISA variants (as is achieved by the (--iree-llvmcpu-target-cpu-features
flag), the resulting floating-point code is allowed to produce slightly different results. When quantizing to low bit depths, these tiny varations may get amplified as the rounding error on some values cause them to move to a different step on the quantized scale, and with int4
quantization, there are only 16 steps, so that makes a particularly big jump.
As to
but even without the
--iree-llvmcpu-target-cpu-features=host
flag, the outputs are different
IIUC the above is only saying that the output without --iree-llvmcpu-target-cpu-features=host
on one machine is different from the output with this flag on either machine. That's still OK. What would be a bit more surprising would be if the output were different with the same exact compilation command line without the flag on both machines.
Even so, as soon as any floating point is involved, it is hard to make clear statements that output on different machines should be exactly the same. What Mahesh mentions above is one way to resolve some floating point discrepancies, but it doesn't take care of all of them.
With pure float32 workloads, these discrepancies are small enough to ignore; with low-bit-depth quantized workloads, they do get large, as observed here, even though the user-facing results may still be good, also as observed here. This has, historically, been a major pain point with NN quantization, until a consensus emerged, embodied in the TOSA spec, to specify NN quantization in a way that enables bit-for-bit exact result across implementations. That however is essentially only possible if floating-point arithmetic is avoided throughout. That's why we generally have no floating-point at all in 8-bit quantized workloads (at least, having any would violate the TOSA spec).
That's why I have been slightly concerned, watching the current int4 quantization work from afar (and being away on vacation for much of that time), to see some int4 quantized workloads dequantizing to f32. I get that int4 quantization is done for memory compression, and that dequantizing to f32 is an easy shortcut around having to think more about being 100% correct. But the fact that we're having this conversation now suggests that now would be a good time to invest in avoiding dequantization to float. Instead, int4 could be expanded to int8 to go through the existing support for 8bit-quantized models. Then this would avoid float arithmetic and produce bit-for-bit identical results across compilation settings and across machines.
EDIT - I've looked at the IR now, confirming it does dequantized i4 to float --- and actually it's to f16 not f32, which helps explain why the floating point rounding discrepancies have such a big impact. Typical generic op performing a i4->f16 dequantization:
%800 = linalg.generic {indexing_maps = [#map2, #map12, #map12, #map2], iterator_types = ["parallel", "parallel", "parallel"]} ins(%expanded_3, %735, %60 : tensor<4096x32x128xi4>, tensor<4096x32x1xf16>, tensor<4096x32x1xf16>) outs(%797 : tensor<4096x32x128xf16>) {
^bb0(%in: i4, %in_1222: f16, %in_1223: f16, %out: f16):
%3161 = arith.extui %in : i4 to i32
%3162 = arith.uitofp %3161 : i32 to f16
%3163 = arith.subf %3162, %in_1223 : f16
%3164 = arith.mulf %3163, %in_1222 : f16
linalg.yield %3164 : f16
} -> tensor<4096x32x128xf16>
We have tried compiling with the --iree-llvmcpu-reassociate-fp-reductions=false
flag and without the --iree-llvmcpu-target-cpu-features=host
flag now, and we get matching results. However, with these flags, the benchmark is very slow now, about 35 seconds.
Here is the new full compile command:
iree-compile --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-triple=x86_64-linux-gnu --iree-llvmcpu-enable-microkernels --iree-llvmcpu-stack-allocation-limit=256000 --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-vm-target-truncate-unsupported-floats --iree-codegen-check-ir-before-llvm-conversion=false --iree-opt-const-expr-hoisting=False --iree-llvmcpu-reassociate-fp-reductions=false ../models/llama2_7b_int4.mlir -o llama2_7b_cpu.vmfb
@yzhang93 is checking if the model from this command generates reasonable tokens
Additionally, I am now on 2f9a42ca657d7bbd2a7976d18ba706f8abf6eb3f and using this new mlir: https://storage.googleapis.com/shark_tank/llama2_7b/unsharded/mlir/llama2_7b_int4_with_broadcast_folding_fixed.mlirbc
(The new IREE version and new model are not affecting the results)
and we get matching results.
Thanks for checking --- that confirms that the discrepancy is safe to ignore (as we aren't observing anything else here than non-associativity of float point arithmetic).
the benchmark is very slow now, about 35 seconds.
Sure - not surprising, not worth investigating. --iree-llvmcpu-reassociate-fp-reductions=false
was just a debugging flag here.
I think at this point we can close this issue as working as intended, and have a follow-up conversation on the Nod side to change the workload's approach, from "dequantize i4
to f16
" to "extend i4
to i8
" as suggested in my previous comment ?
@yzhang93 is checking if the model from this command generates reasonable tokens
After adding this flag --iree-llvmcpu-reassociate-fp-reductions=false
, it runs very slow in shark... Without this flag I was able to get the output tokens immediately after the input, but now it takes about 30 minutes or so to generate outputs.
In theory you should still be able to use iree-llvmcpu-target-cpu-features=host
and just add iree-llvmcpu-reassociate-fp-reductions=false
and still get correct results. The former flag is the one that is affecting your performance that much. There is still a hit with dropping the reassocation but it shouldnt be that much.
Second what Benoit says though. THis is within realm of floating point reassociation that everyone plays fast and loose with. If the "correctness" depends on "one particular reassociation" then there is no hope here.
In theory you should still be able to use iree-llvmcpu-target-cpu-features=host and just add iree-llvmcpu-reassociate-fp-reductions=false and still get correct results. The former flag is the one that is affecting your performance that much.
Yes, when I added --iree-llvmcpu-target-cpu-features=host
, I can again get responses in a short time. So I think we'll need to use this flag for performance.
@Max191 With this flag added, I'm getting different numerical results from the previous outputs. Please cross-check with both iree-llvmcpu-target-cpu-features=host
and iree-llvmcpu-reassociate-fp-reductions=false
added.
@Max191 With this flag added, I'm getting different numerical results from the previous outputs. Please cross-check with both
iree-llvmcpu-target-cpu-features=host
andiree-llvmcpu-reassociate-fp-reductions=false
added.
We have matching results with both of these flags added as well
The iree-llvmcpu-reassociate-fp-reductions=false
flag causes about a 2x slowdown, so it would be good if we could get matching results without it.
Benchmark with iree-llvmcpu-reassociate-fp-reductions=false
2023-08-28T14:39:12-04:00
Running iree-benchmark-module
Run on (24 X 5732.71 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x12)
L1 Instruction 32 KiB (x12)
L2 Unified 1024 KiB (x12)
L3 Unified 32768 KiB (x2)
Load Average: 0.07, 0.08, 0.17
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------------
BM_second_vicuna_forward/process_time/real_time 2223 ms 12262 ms 1 items_per_second=0.449882/s
Benchmark without iree-llvmcpu-reassociate-fp-reductions=false
:
2023-08-28T14:45:45-04:00
Running iree-benchmark-module
Run on (24 X 5732.71 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x12)
L1 Instruction 32 KiB (x12)
L2 Unified 1024 KiB (x12)
L3 Unified 32768 KiB (x2)
Load Average: 1.17, 0.54, 0.31
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------------------------------
BM_second_vicuna_forward/process_time/real_time 1196 ms 6555 ms 1 items_per_second=0.836426/s
The iree-llvmcpu-reassociate-fp-reductions=false flag causes about a 2x slowdown, so it would be good if we could get matching results without it.
I think even though the numerical results have a small difference without iree-llvmcpu-reassociate-fp-reductions=false
flag , the output tokens seem to be the same -- at least for the two simple questions I tested. We can probably have more test questions and check if the tokens are match.
The
iree-llvmcpu-reassociate-fp-reductions=false
flag causes about a 2x slowdown, so it would be good if we could get matching results without it.
You cant :P . You are hitting floating point reassociation errors. Bit exact match is not going to happen. You need to check if the variation degrades the final result (not the bit value, but actual result).
Compiling and running llama2 7b int4 on CPU causes different results in
iree-run-module
compile command:
run command:
In https://github.com/openxla/iree/issues/14772 the reported outputs for CPU are here: https://console.cloud.google.com/storage/browser/_details/shark-public/vivian/llama2_7b_results/llama2_7b_cpu_second_results.txt;tab=live_object?authuser=0&project=nod-cloud
The the outputs from my machine are here: https://drive.google.com/file/d/14VpZ9HHQCK3fDxGmySI9hDyAZBIeXqMx/view?usp=sharing
These outputs are slightly different despite compiling from the same IREE build and using the same commands. However, both produce models that generate reasonable text. It is also worth noting that the
iree-cpuinfo
is different for the two machines in question, but even without the--iree-llvmcpu-target-cpu-features=host
flag, the outputs are differentOutput from my machine without
--iree-llvmcpu-target-cpu-features=host
: https://drive.google.com/file/d/1XHWhROSdIdrFUrZGE94TpHq72J-g-tJ8/view?usp=sharing Output ofiree-cpuinfo
on my machine:Output from https://github.com/openxla/iree/issues/14772 without
--iree-llvmcpu-target-cpu-features=host
is the same as with--iree-llvmcpu-target-cpu-features=host
Output ofiree-cpuinfo
: