Open monorimet opened 1 year ago
I also noticed for larger sequence lengths, a significant performance delta. Is this expected? e2e execution for sequence length 128 version of OPT takes about 9 times longer than the tiny seqlen8 version from above.
Here is a tracy profile for the model executed end-to-end.
To reproduce this trace:
Download opt-1_3b-causallm_128_torch.mlir
Run iree-compile (or download opt-1_3b_causallm_128_torch_cpu-task.vmfb):
iree-compile ./opt-1_3b-causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu -o opt-1_3b_causallm_128_torch_cpu-task.vmfb
Benchmark opt-1_3b_causallm_128_torch_cpu-task.vmfb:
TRACY_NO_EXIT=1 iree-benchmark-module --module=opt-1_3b_causallm_128_torch_cpu-task.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=50 --task_topology_max_group_count=16
capture and profile the trace.
Here is a screenshot of the dispatches from the profile statistics, ordered by total runtime:
Do you have tuned results for dispatch 17 and 18? Looks like we should first boil down these ones.
Do you have tuned results for dispatch 17 and 18? Looks like we should first boil down these ones.
Tuning these dispatches for tile/workgroup sizes does not yield a significant performance delta, unfortunately.
cc @MaheshRavishankar @hanhanW
Current status:
The sequence length = 8 case should see significant improvements from @bjacob 's work; see this discord thread
We are discussing the matmul cases where M > 16, as the associated optimization space is drastically different from the narrow matmul cases. (ideally the existing codegen will yield acceptable performance with the right flags -- I will update this issue with confirmation on whether the 'wider' matmuls are performant with the correct flags.)
thanks @monorimet for the summary; FTR, the above-linked discord thread also contains suggestions of flags to use for the M>=16 case. looking forward to hearing how that performs; we can discuss next steps from there.
Here is an updated set of artifacts and tracy profile for the case where sequence length (M dimension) = 128, at fp16 precision.
The .mlir can be found here: opt-1_3b_causallm_128_torch.mlir
The .vmfb is located here: opt-1_3b_causallm_128_torch_cpu-task.vmfb
The tracy profile is also provided here: opt128_noukernels.tracy
Case 2 - with microkernels enabled:
In this case, I was surprised to see a sharp decrease in performance. I assume that we are simply not in a case where this flag helps, but in case it seems wrong I wanted to share my results:
The .vmfb is located here: opt-1_3b_causallm_128_torch_cpu-task_ukernels.vmfb
The tracy profile is also provided here: opt128_ukernels.tracy
The likely reason why microkernels decrease performance here is the f16
element type --- I have not yet added the optimized code paths for f16
microkernels, so it's currently very slow generic code. I plan to do so this week though, so we can retry then.
Just checking - is it intentional that f16
was the element type here? The benchmarks discussed above were f32
and the last comment even says:
Here is an updated set of artifacts and tracy profile for the case where sequence length (M dimension) = 128, at fp32 precision.
I am also confused to see the f16 precision. Let me make sure this .mlir is fp32 precision and I will update accordingly.
Edit: Yes, my mistake, this .mlir is in half-precision. I'll post again with the fp32 profiles in a few minutes.
Running the same flags as above with the fp32 OPT.mlir results in a segfault in iree-benchmark-module.
I will be removing flags to see if any specific ones are the culprit.
Here are the commands I'm using where the segfaults occur (these worked in fp16):
/home/ean/iree-build/tools/iree-compile ./opt-1_3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-stack-allocation-limit=256000 --iree-llvmcpu-enable-microkernels -o opt-1_3b_causallm_128_torch_cpu-task_ukernels.vmfb
TRACY_NO_EXIT=1 /home/ean/iree-build/tools/iree-benchmark-module --module=opt-1_3b_causallm_128_torch_cpu-task.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=16 --device=local-task
2023-07-11T18:34:15+00:00
Running /home/ean/iree-build/tools/iree-benchmark-module
Run on (16 X 2800.27 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1024 KiB (x8)
L3 Unified 33792 KiB (x1)
Load Average: 21.13, 25.39, 20.56
Segmentation fault (core dumped)
The same occurs without the microkernels flag. I am noticing stack allocations requiring the --iree-llvmcpu-stack-allocation
flag to be set at 131072 (iree-compile fails if it is set lower)
It seems --iree-flow-enable-data-tiling
was the flag causing the segfault in f32. I will upload results without this flag shortly.
That is the flag that triggers everything. Without that flag, the other flags dont do anything. I will take a look, but it will take me sometime (a day or so) to get around to it. We need to track down the stack allocation issue as well.
That is the flag that triggers everything. Without that flag, the other flags dont do anything. I will take a look, but it will take me sometime (a day or so) to get around to it. We need to track down the stack allocation issue as well.
OK. In the meantime I will find out if the sequence length (M dimension in matmuls) is relevant to the segfaults-- 128 is a somewhat arbitrary choice so if we can find some other M>16 that works with data tiling then we can be at least temporarily unblocked.
It seems the data tiling flag now causes this segfault for all sequence lengths. I will see what I can do to bisect the problem and, if possible, isolate the problematic dispatches.
@monorimet I can take a look at this this week. I have to finish a few things before I can get to it. So I'll post here when I get to this.
OK. I have tracked the segfault down to dispatch 25 (correct me if the the numbering could be wrong, I'm not sure if the number in the function signature is the same as the index, but the segfault begins to occur with flow-level data tiling starting at index 25 as shown:)
I was able to compile and run dispatch 25 with the following input:
(shark.venv) ean@ean-highmem:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-compile ./opt-1_3b-causallm_cpu_dispatches/module_forward_dispatch_25_embedded_elf_x86_64_benchmark.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-stack-allocation-limit=131072 --iree-flow-enable-data-tiling -o opt-1_3b_causallm_128_dispatch_25_tiled.vmfb
(shark.venv) ean@ean-highmem:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-benchmark-module --module=./opt-1_3b_causallm_128_dispatch_25_tiled.vmfb
Yielding the following results:
2023-07-12T23:56:22+00:00
Running /home/ean/iree-build/tools/iree-benchmark-module
Run on (16 X 2800.27 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1024 KiB (x8)
L3 Unified 33792 KiB (x1)
Load Average: 0.00, 0.00, 0.00
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_forward_dispatch_25_embedded_elf_x86_64_forward_dispatch_25_matmul_32x8192x2048_f32/process_time/real_time 10150 us 76696 us 52 items_per_second=98.5238/s
This behavior is replicated on sequence lengths 8, 16, 32, and 128 -- It isn't very surprising as the data tiling is happening at the flow level.
I will back up to the flow level and do some more poking around, and try a few other cases tomorrow in f16 / compare with pytorch in the meantime -- please let me know if I can focus my efforts anywhere to help @MaheshRavishankar
I was able to compile and run dispatch 25 with the following input:
(shark.venv) ean@ean-highmem:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-compile ./opt-1_3b-causallm_cpu_dispatches/module_forward_dispatch_25_embedded_elf_x86_64_benchmark.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-stack-allocation-limit=131072 --iree-flow-enable-data-tiling -o opt-1_3b_causallm_128_dispatch_25_tiled.vmfb
(shark.venv) ean@ean-highmem:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-benchmark-module --module=./opt-1_3b_causallm_128_dispatch_25_tiled.vmfb
Yielding the following results:
2023-07-12T23:56:22+00:00 Running /home/ean/iree-build/tools/iree-benchmark-module Run on (16 X 2800.27 MHz CPU s) CPU Caches: L1 Data 32 KiB (x8) L1 Instruction 32 KiB (x8) L2 Unified 1024 KiB (x8) L3 Unified 33792 KiB (x1) Load Average: 0.00, 0.00, 0.00 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations UserCounters... ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ BM_forward_dispatch_25_embedded_elf_x86_64_forward_dispatch_25_matmul_32x8192x2048_f32/process_time/real_time 10150 us 76696 us 52 items_per_second=98.5238/s
This behavior is replicated on sequence lengths 8, 16, 32, and 128 -- It isn't very surprising as the data tiling is happening at the flow level.
I will back up to the flow level and do some more poking around, and try a few other cases tomorrow in f16 / compare with pytorch in the meantime -- please let me know if I can focus my efforts anywhere to help @MaheshRavishankar
Could you just create an issue with the dispatch itself on IREE. If you see the IR after iree-flow-outline-dispatch-region
you will see a bunch of flow.executable
ops. You can take the func.func
that is within the dispatch that is the offending dispatch and pretty easily recreate a small repro. If you post even just the func.func
within the flow.executable
, that should be a good enough starting point. I think the issue is with respect to pack
and unpack
fusion. We might be fusing too aggressively.
Also, --mlir-elide-elementsattrs-if-larger=10
(or some such value) helps generate testcases that are not bloated by large constant data. And --iree-util-zero-fill-elided-attrs
to make those elided constant buffers be treated as filled with zeros.
OK, I have filed the issue in IREE. For the segfaults we can move discussion to that issue until resolved.
I have adapted a script for running a perf comparison (SHARK/IREE vs. PyTorch) for opt1.3b causallm inference.
Without data-tiling (with SHARK's default iree cpu flags -- I can dig these out if anyone is interested) we achieve the following:
(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ python opt_perf_comparison.py
Loading flatbuffer at opt_1-3b_causallm_8_torch_cpu.vmfb as a mmapped file
[DEBUG] setting iree runtime flags for cpu:
--task_topology_max_group_count=14
--- Took 0.2379467487335205 seconds to load Shark.
prompt: What is the meaning of life?
prompt: Tell me something you don't know.
prompt: What does Xilinx do?
prompt: What is the mass of earth?
prompt: What is a poem?
prompt: What is recursion?
prompt: Tell me a one line joke.
prompt: Who is Gilgamesh?
prompt: Tell me something about cryptocurrency.
prompt: How did it all begin?
--- Took 4.178373336791992 seconds to run Shark.
--- Took 5.567362308502197 seconds to load Huggingface.
prompt: What is the meaning of life?
prompt: Tell me something you don't know.
prompt: What does Xilinx do?
prompt: What is the mass of earth?
prompt: What is a poem?
prompt: What is recursion?
prompt: Tell me a one line joke.
prompt: Who is Gilgamesh?
prompt: Tell me something about cryptocurrency.
prompt: How did it all begin?
--- Took 3.4269025325775146 seconds to run Huggingface.
--- Took 4.178373336791992 seconds to run Shark.
--- Took 3.4269025325775146 seconds to run Huggingface.
@monorimet - With https://github.com/openxla/iree/issues/14398 now fixed, here are some benchmark results to give a flavor of performance to expect. Note - testing on a Intel Skylake-XEON CPU with AVX-512. Compiling with --iree-llvmcpu-target-cpu=skylake-avx512. Command lines as in the original PR description above.
So, data-tiling alone is a ~ 8x speedup. Ukernels alone are not yet good. But I'll get to that now, and it will be at least as fast as non-ukernels and in some cases faster. What's almost certainly happening here is that this particular model is f32, and f32 matmuls on ISAs like AVX-512 are what default codegen is good at. As soon as we depart from that, e.g. f16, things are more challenging for default codegen and the ukernels become more of a win.
Nice! Are these the narrow shapes case? E.g., mostly I've seen ukernels doing rather well :)
Nice! Are these the narrow shapes case? E.g., mostly I've seen ukernels doing rather well :)
Oh I get it now - no it's not narrow (it's opt-1_3b_causallm_128_torch.mlir
which has sequence length 128) but contrary to what I assumed in my previous comment, it is a f16
model. So neither the default codegen nor the ukernels are good at the moment, but the ukernels are even worse, as they are running completely slow generic code. OK, didn't remember we were already looking at f16 --- i'll reorder my queue so f16 ukernels come sooner.
Nice! Are these the narrow shapes case? E.g., mostly I've seen ukernels doing rather well :)
Oh I get it now - no it's not narrow (it's
opt-1_3b_causallm_128_torch.mlir
which has sequence length 128) but contrary to what I assumed in my previous comment, it is af16
model. So neither the default codegen nor the ukernels are good at the moment, but the ukernels are even worse, as they are running completely slow generic code. OK, didn't remember we were already looking at f16 --- i'll reorder my queue so f16 ukernels come sooner.
Whats the target here? Without ukernels but data tiling, this is at 72 us (maybe 515us is just so bad, that this is unfair comparison)
I'll let Nod decide if there's a specific target; all I know is we still have plenty of room to run :-)
Nice! Are these the narrow shapes case? E.g., mostly I've seen ukernels doing rather well :)
Oh I get it now - no it's not narrow (it's
opt-1_3b_causallm_128_torch.mlir
which has sequence length 128) but contrary to what I assumed in my previous comment, it is af16
model. So neither the default codegen nor the ukernels are good at the moment, but the ukernels are even worse, as they are running completely slow generic code. OK, didn't remember we were already looking at f16 --- i'll reorder my queue so f16 ukernels come sooner.
The first case we wanted to meet/beat pytorch performance on was OPT1.3b in fp32 precision. With data tiling, I do see a significant improvement in e2e execution time (tested on this iree SHA)
The following are e2e benchmarks on OPT-1.3b in fp32 precision, at sequence length 128, with avx512 instructions enabled. I will preface each result with the reproduction commands.
Link to opt_1-3b_causallm_128_torch.mlir
Case 1: No data tiling, no microkernels
iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-stack-allocation-limit=140000 -o opt_1-3b_causallm_128_torch_cpu_base.vmfb
iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_base.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=16 --device=local-task
2023-07-18T11:06:19-07:00
Running /home/ean/SHARK/shark.venv/lib/python3.11/site-packages/iree/runtime/scripts/iree_benchmark_module/../../iree-benchmark-module
Run on (16 X 4000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1024 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.33, 0.38, 0.46
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time 2072 ms 15921 ms 1 items_per_second=0.482637/s
BM_forward/process_time/real_time 2060 ms 15901 ms 1 items_per_second=0.485382/s
BM_forward/process_time/real_time 2068 ms 16012 ms 1 items_per_second=0.483572/s
BM_forward/process_time/real_time 2052 ms 15923 ms 1 items_per_second=0.4873/s
BM_forward/process_time/real_time 2089 ms 16011 ms 1 items_per_second=0.478638/s
BM_forward/process_time/real_time 2177 ms 15987 ms 1 items_per_second=0.459379/s
BM_forward/process_time/real_time 2058 ms 15968 ms 1 items_per_second=0.485893/s
BM_forward/process_time/real_time 2042 ms 15825 ms 1 items_per_second=0.489682/s
BM_forward/process_time/real_time 2049 ms 15889 ms 1 items_per_second=0.488093/s
BM_forward/process_time/real_time 2053 ms 15936 ms 1 items_per_second=0.487093/s
BM_forward/process_time/real_time_mean 2072 ms 15937 ms 10 items_per_second=0.482767/s
BM_forward/process_time/real_time_median 2059 ms 15930 ms 10 items_per_second=0.485638/s
BM_forward/process_time/real_time_stddev 39.2 ms 58.9 ms 10 items_per_second=8.79873m/s
BM_forward/process_time/real_time_cv 1.89 % 0.37 % 10 items_per_second=1.82%
Case 2: Data tiling, no ukernels:
iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-stack-allocation-limit=140000 -o opt_1-3b_causallm_128_to
rch_cpu_tiled.vmfb
iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_tiled.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=14 --device=local-task
2023-07-18T12:29:20-07:00
Running /home/ean/SHARK/shark.venv/lib/python3.11/site-packages/iree/runtime/scripts/iree_benchmark_module/../../iree-benchmark-module
Run on (16 X 4000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1024 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 2.18, 1.57, 0.80
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time 762 ms 5892 ms 1 items_per_second=1.31257/s
BM_forward/process_time/real_time 764 ms 5918 ms 1 items_per_second=1.30817/s
BM_forward/process_time/real_time 764 ms 5911 ms 1 items_per_second=1.3083/s
BM_forward/process_time/real_time 778 ms 5893 ms 1 items_per_second=1.28507/s
BM_forward/process_time/real_time 766 ms 5936 ms 1 items_per_second=1.30487/s
BM_forward/process_time/real_time 764 ms 5934 ms 1 items_per_second=1.30868/s
BM_forward/process_time/real_time 763 ms 5926 ms 1 items_per_second=1.30978/s
BM_forward/process_time/real_time 765 ms 5929 ms 1 items_per_second=1.30763/s
BM_forward/process_time/real_time 766 ms 5924 ms 1 items_per_second=1.30607/s
BM_forward/process_time/real_time 762 ms 5895 ms 1 items_per_second=1.31302/s
BM_forward/process_time/real_time_mean 765 ms 5916 ms 10 items_per_second=1.30642/s
BM_forward/process_time/real_time_median 764 ms 5921 ms 10 items_per_second=1.30824/s
BM_forward/process_time/real_time_stddev 4.70 ms 17.1 ms 10 items_per_second=7.91746m/s
BM_forward/process_time/real_time_cv 0.61 % 0.29 % 10 items_per_second=0.61%
Case 3: Data tiling and ukernels:
iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-stack-allocation-limit=140000 --iree-llvmcpu-enable-microkernels -o opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb
iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=14 --device=local-task
2023-07-18T12:33:04-07:00
Running /home/ean/SHARK/shark.venv/lib/python3.11/site-packages/iree/runtime/scripts/iree_benchmark_module/../../iree-benchmark-module
Run on (16 X 2093.89 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1024 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.85, 1.23, 0.84
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time 751 ms 5786 ms 1 items_per_second=1.33228/s
BM_forward/process_time/real_time 750 ms 5808 ms 1 items_per_second=1.33304/s
BM_forward/process_time/real_time 750 ms 5814 ms 1 items_per_second=1.33299/s
BM_forward/process_time/real_time 748 ms 5775 ms 1 items_per_second=1.33752/s
BM_forward/process_time/real_time 753 ms 5824 ms 1 items_per_second=1.32875/s
BM_forward/process_time/real_time 753 ms 5773 ms 1 items_per_second=1.32838/s
BM_forward/process_time/real_time 747 ms 5768 ms 1 items_per_second=1.33878/s
BM_forward/process_time/real_time 750 ms 5791 ms 1 items_per_second=1.33385/s
BM_forward/process_time/real_time 752 ms 5801 ms 1 items_per_second=1.32931/s
BM_forward/process_time/real_time 750 ms 5803 ms 1 items_per_second=1.33333/s
BM_forward/process_time/real_time_mean 750 ms 5794 ms 10 items_per_second=1.33282/s
BM_forward/process_time/real_time_median 750 ms 5796 ms 10 items_per_second=1.33301/s
BM_forward/process_time/real_time_stddev 1.95 ms 18.8 ms 10 items_per_second=3.46296m/s
BM_forward/process_time/real_time_cv 0.26 % 0.32 % 10 items_per_second=0.26%
So in my case the tiled + ukernels mode seems to produce the best results.
Examining performance delta vs. pytorch is a bit tricky -- we have to feed an input of 127 tokens for the pytorch model to be comparable to our sequence length 128 model. Evidently padding with the tokenizer doesn't seem to stop PyTorch from using the smallest possible model for optimal performance. This is generally equivalent in behavior to the dynamic path in torch-mlir/IREE stack.
Since we are looking at sequence length 128, I've just run the performance comparison with excerpts of the declaration of independence with 127 words each, for 5 iterations:
(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ python opt_perf_comparison.py
Loading flatbuffer at opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb as a mmapped file
[DEBUG] setting iree runtime flags for cpu:
--task_topology_max_group_count=14
--- Took 0.23573708534240723 seconds to load Shark.
prompt: We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.--That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, --That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes
prompt: <truncated>
prompt: <truncated>
prompt: <truncated>
prompt: <truncated>
--- Took 4.245878458023071 seconds to run Shark.
--- Took 5.405649662017822 seconds to load Huggingface.
prompt: <truncated>
prompt: <truncated>
prompt: <truncated>
prompt: <truncated>
prompt: <truncated>
--- Took 3.9855310916900635 seconds to run Huggingface.
4.24 vs. 3.9 is really quite good! Is there anything I've missed in latest IREE, from these reproducers, that could push us past pytorch performance?
https://github.com/openxla/iree/pull/13822 might help a bit
So in my case the tiled + ukernels mode seems to produce the best results.
This is because this is a f32
model, and the sequence length is high enough that this doesn't depend on microkernels handling of narrow cases. So there, as long as you're data-tiling, both pure codegen and microkernels perform well, with microkernels being even a little faster --- that's consistent with what we've observed on other models.
When the data type is not f32
, pure codegen tends to struggle a bit, and microkernels can shine if they have a dedicated code path for that case, or be extremely slow if they don't --- that must sound scary, but it's not, because it's a problem that we fix once and for all per element type and then it applies to all models.
When the shapes are narrow (when the sequence length is small), codegen tends to adapt gracefully, but microkernels don't currently have fast code for narrow cases, so that's the other thing I want to fix very soon in microkernels.
Further performance gains beyond that point will come from:
bf16
arithmetic, so they would benefit from switching to that (even more cutting-edge is support for f16
arithmetic, currently limited to Intel Sapphire Rapids).Examining performance delta vs. pytorch is a bit tricky
So I understand correctly this log: Huggingface
is the pytorch value ?
openxla/iree#13822 might help a bit
Thanks. I am building with tracy on latest IREE to get a trace of the tiled ukernels case, so I will share results with iree-llvmcpu-reassociate-fp-reductions
soon.
openxla/iree#13822 might help a bit
Thanks. I am building with tracy on latest IREE to get a trace of the tiled ukernels case, so I will share results with
iree-llvmcpu-reassociate-fp-reductions
soon.
WIth latest ToT that flag is on by default.
So I understand correctly this log:
Huggingface
is the pytorch value ?
Yes, that is the value we get from pytorch runtime.
Sorry to say I seem to get better performance with --iree-llvmcpu-reassociate-fp-reductions=False
:
--iree-llvmcpu-reassociate-fp-reductions=True
(default)
(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-stack-allocation-limit=140000
--iree-llvmcpu-enable-microkernels -o opt_1-3b_causallm_128_torch_cpu_tiled_ukernels_fp_reassoc.vmfb
(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_tiled_ukernels_fp_reassoc.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=14 --device=local-task
2023-07-18T13:32:23-07:00
Running /home/ean/SHARK/shark.venv/lib/python3.11/site-packages/iree/runtime/scripts/iree_benchmark_module/../../iree-benchmark-module
Run on (16 X 4000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1024 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.67, 3.14, 7.56
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time 798 ms 5921 ms 1 items_per_second=1.25344/s
BM_forward/process_time/real_time 759 ms 5879 ms 1 items_per_second=1.31727/s
BM_forward/process_time/real_time 762 ms 5892 ms 1 items_per_second=1.31168/s
BM_forward/process_time/real_time 754 ms 5837 ms 1 items_per_second=1.32597/s
BM_forward/process_time/real_time 759 ms 5862 ms 1 items_per_second=1.31826/s
BM_forward/process_time/real_time 760 ms 5873 ms 1 items_per_second=1.31539/s
BM_forward/process_time/real_time 761 ms 5879 ms 1 items_per_second=1.31371/s
BM_forward/process_time/real_time 762 ms 5895 ms 1 items_per_second=1.3116/s
BM_forward/process_time/real_time 762 ms 5897 ms 1 items_per_second=1.3127/s
BM_forward/process_time/real_time 763 ms 5898 ms 1 items_per_second=1.3103/s
BM_forward/process_time/real_time_mean 764 ms 5883 ms 10 items_per_second=1.30903/s
BM_forward/process_time/real_time_median 761 ms 5886 ms 10 items_per_second=1.31321/s
BM_forward/process_time/real_time_stddev 12.1 ms 22.9 ms 10 items_per_second=0.0200596/s
BM_forward/process_time/real_time_cv 1.59 % 0.39 % 10 items_per_second=1.53%
--iree-llvmcpu-reassociate-fp-reductions=False
:
(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-stack-allocation-limit=140000 --iree-llvmcpu-enable-microkernels --iree-llvmcpu-reassociate-fp-reductions=False -o opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb
(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=14 --device=local-task
2023-07-18T13:27:49-07:00
Running /home/ean/SHARK/shark.venv/lib/python3.11/site-packages/iree/runtime/scripts/iree_benchmark_module/../../iree-benchmark-module
Run on (16 X 4000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1024 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.92, 6.57, 9.84
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time 751 ms 5790 ms 1 items_per_second=1.33219/s
BM_forward/process_time/real_time 754 ms 5833 ms 1 items_per_second=1.32702/s
BM_forward/process_time/real_time 756 ms 5843 ms 1 items_per_second=1.32273/s
BM_forward/process_time/real_time 758 ms 5836 ms 1 items_per_second=1.31952/s
BM_forward/process_time/real_time 752 ms 5799 ms 1 items_per_second=1.32904/s
BM_forward/process_time/real_time 754 ms 5821 ms 1 items_per_second=1.32564/s
BM_forward/process_time/real_time 754 ms 5808 ms 1 items_per_second=1.32647/s
BM_forward/process_time/real_time 755 ms 5821 ms 1 items_per_second=1.32446/s
BM_forward/process_time/real_time 756 ms 5832 ms 1 items_per_second=1.32258/s
BM_forward/process_time/real_time 757 ms 5846 ms 1 items_per_second=1.32073/s
BM_forward/process_time/real_time_mean 755 ms 5823 ms 10 items_per_second=1.32504/s
BM_forward/process_time/real_time_median 755 ms 5826 ms 10 items_per_second=1.32505/s
BM_forward/process_time/real_time_stddev 2.20 ms 18.8 ms 10 items_per_second=3.86125m/s
BM_forward/process_time/real_time_cv 0.29 % 0.32 % 10 items_per_second=0.29%
With debug iree-runtime builds, the results are a bit more sporadic so the two cases seem quite similar (I can share those numbers if desired)
Tracy profile for the fastest configuration (latest IREE, seqlen 128, fp32, avx512) (link)
Dispatch list by total time:
Are the matrix multiplications in this model involving a matrix of constant weights (as is the case in many NN inference workloads) as opposed to runtime values being multiplied by runtime values (as is the case in some recent NN architectures like transformers) ?
If some matmul operands are constant data, then the corresponding set_encoding dispatches are running on constant data and are prime candidates for being constant-evaluated (--iree-opt-const-eval
).
Hmm yes it really is lots of big constant matrices (just looking at the f16
model here but that should be the same).
func.func @forward(%arg0: tensor<1x128xi64>, %arg1: tensor<1x128xi64>) -> tensor<1x128x50272xf16> {
%cst = arith.constant dense_resource<__elided__> : tensor<2048xf16>
%cst_0 = arith.constant dense_resource<__elided__> : tensor<2048xf16>
%cst_1 = arith.constant dense_resource<__elided__> : tensor<2048x8192xf16>
%cst_2 = arith.constant dense_resource<__elided__> : tensor<8192xf16>
%cst_3 = arith.constant dense_resource<__elided__> : tensor<8192x2048xf16>
%cst_4 = arith.constant dense_resource<__elided__> : tensor<2048xf16>
%cst_5 = arith.constant dense_resource<__elided__> : tensor<2048xf16>
For example, %cst_1
is used by a linalg.generic
performing a transposition,
%2040 = linalg.generic {indexing_maps = [#map4, #map17], iterator_types = ["parallel", "parallel"]} ins(%cst_1 : tensor<2048x8192xf16>) outs(%153 : tensor<8192x2048xf16>) {
^bb0(%in: f16, %out: f16):
linalg.yield %in : f16
} -> tensor<8192x2048xf16>
and that in turn becomes an operand to matmul
:
%2041 = linalg.matmul ins(%2039, %2040 : tensor<128x8192xf16>, tensor<8192x2048xf16>) outs(%73 : tensor<128x2048xf16>) -> tensor<128x2048xf16>
So when IREE data-tiles that matmul, it creates a set_encoding
dispatch consuming %2040
and running set_encoding
on it, which in codegen (MaterializeEncodingPass) becomes a tensor.pack
op. That dispatch is running on constant data (and clearly there are many more in this model) so I really expect that you'll see a benefit from --iree-opt-const-eval
Are the matrix multiplications in this model involving a matrix of constant weights (as is the case in many NN inference workloads) as opposed to runtime values being multiplied by runtime values (as is the case in some recent NN architectures like transformers) ?
If some matmul operands are constant data, then the corresponding set_encoding dispatches are running on constant data and are prime candidates for being constant-evaluated (
--iree-opt-const-eval
).
Is the flag all that's necessary to constant-evaluate the set_encoding dispatches?
I got slightly less performant results with --iree-opt-const-eval
:
(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-stack-allocation-limit=140000 --iree-llvmcpu-enable-microkernels --iree-llvmcpu-reassociate-fp-reductions=False --iree-opt-const-eval -o opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb
(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ TRACY_NO_EXIT=1 /home/ean/iree-build/tools/iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=14 --device=local-task
2023-07-18T14:13:25-07:00
Running /home/ean/iree-build/tools/iree-benchmark-module
Run on (16 X 4000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 1024 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.12, 0.16, 0.67
***WARNING*** Library was built as DEBUG. Timings may be affected.
---------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time 813 ms 6037 ms 1 items_per_second=1.23032/s
BM_forward/process_time/real_time 775 ms 6013 ms 1 items_per_second=1.28964/s
BM_forward/process_time/real_time 774 ms 5997 ms 1 items_per_second=1.2922/s
BM_forward/process_time/real_time 775 ms 5990 ms 1 items_per_second=1.29072/s
BM_forward/process_time/real_time 775 ms 5998 ms 1 items_per_second=1.29076/s
BM_forward/process_time/real_time 771 ms 5954 ms 1 items_per_second=1.29677/s
BM_forward/process_time/real_time 774 ms 5970 ms 1 items_per_second=1.29181/s
BM_forward/process_time/real_time 773 ms 5966 ms 1 items_per_second=1.29397/s
BM_forward/process_time/real_time 774 ms 5976 ms 1 items_per_second=1.29195/s
BM_forward/process_time/real_time 775 ms 5988 ms 1 items_per_second=1.28976/s
BM_forward/process_time/real_time_mean 778 ms 5989 ms 10 items_per_second=1.28579/s
BM_forward/process_time/real_time_median 774 ms 5989 ms 10 items_per_second=1.29129/s
BM_forward/process_time/real_time_stddev 12.3 ms 24.3 ms 10 items_per_second=0.0196042/s
BM_forward/process_time/real_time_cv 1.58 % 0.41 % 10 items_per_second=1.52%
hum, nothing out of the top of my head. need to look into this.
There are a couple of things we need to do to get the const eval to work here. Its not a simple flag flip. One we need a way for the MaterializeEncodingPass
to use the target information other than what is specified on the dispatch (basically have an "override" attribute somewhere and set that attribute during const eval so that the const eval can run those dispatches.
Once that is done, we just need to make const eval hoisting to hoist out the const -> set_encoding .
@MaheshRavishankar thanks for the explanation, if you file an issue with a ~ 4x expanded version of that to get me started, i might be able to try.
I realized meanwhile that we also needed --iree-opt-const-expr-hoisting
for part of what you're saying here, but I was missing that there would be something specific to set_encoding here.
@MaheshRavishankar thanks for the explanation, if you file an issue with a ~ 4x expanded version of that to get me started, i might be able to try.
I dont know what the full solution is, but definitely worth starting an issue and describing what I know, and getting Ben/Stella's help on the remaining. Stay tuned.
I realized meanwhile that we also needed
--iree-opt-const-expr-hoisting
for part of what you're saying here, but I was missing that there would be something specific to set_encoding here.
There is already an issue for this https://github.com/openxla/iree/issues/11360. Ill add some things there
For OPT-1.3b (fp32) we would like to burn down performance at the dispatch level.
Here is a tracy profile for the model executed end-to-end.
To reproduce this trace:
Download opt-1_3b-causallm_cpu_torch.mlir
Run iree-compile (or download opt_untuned.vmfb):
Benchmark opt_untuned.vmfb:
capture and profile the trace.
Here is a screenshot of the dispatches from the profile statistics, ordered by total runtime: