[CPU] Small differences in output of llama2 7b int4 model on different CPUs

Max191 commented 1 year ago

Compiling and running llama2 7b int4 on CPU causes different results in iree-run-module

compile command:

iree-compile --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-target-triple=x86_64-linux-gnu --iree-llvmcpu-enable-microkernels --iree-llvmcpu-stack-allocation-limit=256000 --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-vm-target-truncate-unsupported-floats --iree-codegen-check-ir-before-llvm-conversion=false --iree-opt-const-expr-hoisting=False ~/llama2_7b_int4.mlir -o llama2_7b_cpu.vmfb

run command:

iree-run-module --module=llama2_7b_cpu.vmfb --device=local-task --function=second_vicuna_forward --input=1x1xi64 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16  --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16 --input=1x32x1x128xf16

In https://github.com/openxla/iree/issues/14772 the reported outputs for CPU are here: https://console.cloud.google.com/storage/browser/_details/shark-public/vivian/llama2_7b_results/llama2_7b_cpu_second_results.txt;tab=live_object?authuser=0&project=nod-cloud

The the outputs from my machine are here: https://drive.google.com/file/d/14VpZ9HHQCK3fDxGmySI9hDyAZBIeXqMx/view?usp=sharing

These outputs are slightly different despite compiling from the same IREE build and using the same commands. However, both produce models that generate reasonable text. It is also worth noting that the iree-cpuinfo is different for the two machines in question, but even without the --iree-llvmcpu-target-cpu-features=host flag, the outputs are different

Output from my machine without --iree-llvmcpu-target-cpu-features=host: https://drive.google.com/file/d/1XHWhROSdIdrFUrZGE94TpHq72J-g-tJ8/view?usp=sharing Output of iree-cpuinfo on my machine:

sse3                 1
ssse3                1
sse4.1               1
sse4.2               1
sse4a                1
avx                  1
fma                  1
fma4                 0
xop                  0
f16c                 1
avx2                 1
avx512f              1
avx512cd             1
avx512vl             1
avx512dq             1
avx512bw             1
avx512ifma           1
avx512vbmi           1
avx512vpopcntdq      1
avx512vnni           1
avx512vbmi2          1
avx512bitalg         1
avx512bf16           1
avx512fp16           1
amx-tile             1
amx-int8             1
amx-bf16             1

Output from https://github.com/openxla/iree/issues/14772 without --iree-llvmcpu-target-cpu-features=host is the same as with --iree-llvmcpu-target-cpu-features=host Output of iree-cpuinfo:

sse3                 1
ssse3                1
sse4.1               1
sse4.2               1
sse4a                0
avx                  1
fma                  1
fma4                 0
xop                  0
f16c                 1
avx2                 1
avx512f              0
avx512cd             0
avx512vl             0
avx512dq             0
avx512bw             0
avx512ifma           0
avx512vbmi           0
avx512vpopcntdq      0
avx512vnni           0
avx512vbmi2          0
avx512bitalg         0
avx512bf16           0
avx512fp16           0
amx-tile             0
amx-int8             0
amx-bf16             0

powderluv commented 1 year ago

Please also add the iree-cpuinfo of both machines. And results with / without the host flag.

Max191 commented 1 year ago

Please also add the iree-cpuinfo of both machines. And results with / without the host flag.

Updated

MaheshRavishankar commented 1 year ago

Could you try with --iree-llvmcpu-reassociate-fp-reductions=false and see if the differences go away?

bjacob commented 1 year ago

It's not clear that anything here is not "working as intended". When codegen is specialized for different ISA variants (as is achieved by the (--iree-llvmcpu-target-cpu-features flag), the resulting floating-point code is allowed to produce slightly different results. When quantizing to low bit depths, these tiny varations may get amplified as the rounding error on some values cause them to move to a different step on the quantized scale, and with int4 quantization, there are only 16 steps, so that makes a particularly big jump.

As to

but even without the --iree-llvmcpu-target-cpu-features=host flag, the outputs are different

IIUC the above is only saying that the output without --iree-llvmcpu-target-cpu-features=host on one machine is different from the output with this flag on either machine. That's still OK. What would be a bit more surprising would be if the output were different with the same exact compilation command line without the flag on both machines.

Even so, as soon as any floating point is involved, it is hard to make clear statements that output on different machines should be exactly the same. What Mahesh mentions above is one way to resolve some floating point discrepancies, but it doesn't take care of all of them.

With pure float32 workloads, these discrepancies are small enough to ignore; with low-bit-depth quantized workloads, they do get large, as observed here, even though the user-facing results may still be good, also as observed here. This has, historically, been a major pain point with NN quantization, until a consensus emerged, embodied in the TOSA spec, to specify NN quantization in a way that enables bit-for-bit exact result across implementations. That however is essentially only possible if floating-point arithmetic is avoided throughout. That's why we generally have no floating-point at all in 8-bit quantized workloads (at least, having any would violate the TOSA spec).

That's why I have been slightly concerned, watching the current int4 quantization work from afar (and being away on vacation for much of that time), to see some int4 quantized workloads dequantizing to f32. I get that int4 quantization is done for memory compression, and that dequantizing to f32 is an easy shortcut around having to think more about being 100% correct. But the fact that we're having this conversation now suggests that now would be a good time to invest in avoiding dequantization to float. Instead, int4 could be expanded to int8 to go through the existing support for 8bit-quantized models. Then this would avoid float arithmetic and produce bit-for-bit identical results across compilation settings and across machines.

EDIT - I've looked at the IR now, confirming it does dequantized i4 to float --- and actually it's to f16 not f32, which helps explain why the floating point rounding discrepancies have such a big impact. Typical generic op performing a i4->f16 dequantization:

    %800 = linalg.generic {indexing_maps = [#map2, #map12, #map12, #map2], iterator_types = ["parallel", "parallel", "parallel"]} ins(%expanded_3, %735, %60 : tensor<4096x32x128xi4>, tensor<4096x32x1xf16>, tensor<4096x32x1xf16>) outs(%797 : tensor<4096x32x128xf16>) {
    ^bb0(%in: i4, %in_1222: f16, %in_1223: f16, %out: f16):
      %3161 = arith.extui %in : i4 to i32
      %3162 = arith.uitofp %3161 : i32 to f16
      %3163 = arith.subf %3162, %in_1223 : f16
      %3164 = arith.mulf %3163, %in_1222 : f16
      linalg.yield %3164 : f16
    } -> tensor<4096x32x128xf16>

Max191 commented 1 year ago

We have tried compiling with the --iree-llvmcpu-reassociate-fp-reductions=false flag and without the --iree-llvmcpu-target-cpu-features=host flag now, and we get matching results. However, with these flags, the benchmark is very slow now, about 35 seconds.

Here is the new full compile command:

iree-compile --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-triple=x86_64-linux-gnu --iree-llvmcpu-enable-microkernels --iree-llvmcpu-stack-allocation-limit=256000 --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-vm-target-truncate-unsupported-floats --iree-codegen-check-ir-before-llvm-conversion=false --iree-opt-const-expr-hoisting=False --iree-llvmcpu-reassociate-fp-reductions=false ../models/llama2_7b_int4.mlir -o llama2_7b_cpu.vmfb

@yzhang93 is checking if the model from this command generates reasonable tokens

Additionally, I am now on 2f9a42ca657d7bbd2a7976d18ba706f8abf6eb3f and using this new mlir: https://storage.googleapis.com/shark_tank/llama2_7b/unsharded/mlir/llama2_7b_int4_with_broadcast_folding_fixed.mlirbc

(The new IREE version and new model are not affecting the results)

bjacob commented 1 year ago

and we get matching results.

Thanks for checking --- that confirms that the discrepancy is safe to ignore (as we aren't observing anything else here than non-associativity of float point arithmetic).

the benchmark is very slow now, about 35 seconds.

Sure - not surprising, not worth investigating. --iree-llvmcpu-reassociate-fp-reductions=false was just a debugging flag here.

I think at this point we can close this issue as working as intended, and have a follow-up conversation on the Nod side to change the workload's approach, from "dequantize i4 to f16" to "extend i4 to i8" as suggested in my previous comment ?

yzhang93 commented 1 year ago

@yzhang93 is checking if the model from this command generates reasonable tokens

After adding this flag --iree-llvmcpu-reassociate-fp-reductions=false, it runs very slow in shark... Without this flag I was able to get the output tokens immediately after the input, but now it takes about 30 minutes or so to generate outputs.

MaheshRavishankar commented 1 year ago

In theory you should still be able to use iree-llvmcpu-target-cpu-features=host and just add iree-llvmcpu-reassociate-fp-reductions=false and still get correct results. The former flag is the one that is affecting your performance that much. There is still a hit with dropping the reassocation but it shouldnt be that much. Second what Benoit says though. THis is within realm of floating point reassociation that everyone plays fast and loose with. If the "correctness" depends on "one particular reassociation" then there is no hope here.

yzhang93 commented 1 year ago

In theory you should still be able to use iree-llvmcpu-target-cpu-features=host and just add iree-llvmcpu-reassociate-fp-reductions=false and still get correct results. The former flag is the one that is affecting your performance that much.

Yes, when I added --iree-llvmcpu-target-cpu-features=host, I can again get responses in a short time. So I think we'll need to use this flag for performance.

@Max191 With this flag added, I'm getting different numerical results from the previous outputs. Please cross-check with both iree-llvmcpu-target-cpu-features=host and iree-llvmcpu-reassociate-fp-reductions=false added.

Max191 commented 1 year ago

@Max191 With this flag added, I'm getting different numerical results from the previous outputs. Please cross-check with both iree-llvmcpu-target-cpu-features=host and iree-llvmcpu-reassociate-fp-reductions=false added.

We have matching results with both of these flags added as well

Max191 commented 1 year ago

The iree-llvmcpu-reassociate-fp-reductions=false flag causes about a 2x slowdown, so it would be good if we could get matching results without it.

Benchmark with iree-llvmcpu-reassociate-fp-reductions=false

2023-08-28T14:39:12-04:00
Running iree-benchmark-module
Run on (24 X 5732.71 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x12)
  L1 Instruction 32 KiB (x12)
  L2 Unified 1024 KiB (x12)
  L3 Unified 32768 KiB (x2)
Load Average: 0.07, 0.08, 0.17
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------
BM_second_vicuna_forward/process_time/real_time       2223 ms        12262 ms            1 items_per_second=0.449882/s

Benchmark without iree-llvmcpu-reassociate-fp-reductions=false:

2023-08-28T14:45:45-04:00
Running iree-benchmark-module
Run on (24 X 5732.71 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x12)
  L1 Instruction 32 KiB (x12)
  L2 Unified 1024 KiB (x12)
  L3 Unified 32768 KiB (x2)
Load Average: 1.17, 0.54, 0.31
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
----------------------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------------------
BM_second_vicuna_forward/process_time/real_time       1196 ms         6555 ms            1 items_per_second=0.836426/s

yzhang93 commented 1 year ago

The iree-llvmcpu-reassociate-fp-reductions=false flag causes about a 2x slowdown, so it would be good if we could get matching results without it.

I think even though the numerical results have a small difference without iree-llvmcpu-reassociate-fp-reductions=false flag , the output tokens seem to be the same -- at least for the two simple questions I tested. We can probably have more test questions and check if the tokens are match.

MaheshRavishankar commented 1 year ago

The iree-llvmcpu-reassociate-fp-reductions=false flag causes about a 2x slowdown, so it would be good if we could get matching results without it.

You cant :P . You are hitting floating point reassociation errors. Bit exact match is not going to happen. You need to check if the variation degrades the final result (not the bit value, but actual result).

iree-org / iree

[CPU] Small differences in output of llama2 7b int4 model on different CPUs #14842