OPT-1.3b performance tracker

monorimet commented 1 year ago

For OPT-1.3b (fp32) we would like to burn down performance at the dispatch level.

Here is a tracy profile for the model executed end-to-end.

To reproduce this trace:

Download opt-1_3b-causallm_cpu_torch.mlir

Run iree-compile (or download opt_untuned.vmfb):

iree-compile ./opt-1_3b-causallm_cpu_torch.mlir --iree-hal-target-backends=llvm-cpu -o opt_untuned.vmfb

Benchmark opt_untuned.vmfb:

TRACY_NO_EXIT=1 iree-benchmark-module --module=opt_untuned.vmfb --function="forward" --input=1x8xi64 --input=1x8xi64 --benchmark_repetitions=50 --task_topology_max_group_count=16

capture and profile the trace.

Here is a screenshot of the dispatches from the profile statistics, ordered by total runtime:

monorimet commented 1 year ago

I also noticed for larger sequence lengths, a significant performance delta. Is this expected? e2e execution for sequence length 128 version of OPT takes about 9 times longer than the tiny seqlen8 version from above.

Here is a tracy profile for the model executed end-to-end.

To reproduce this trace:

Download opt-1_3b-causallm_128_torch.mlir

Run iree-compile (or download opt-1_3b_causallm_128_torch_cpu-task.vmfb):

iree-compile ./opt-1_3b-causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu -o opt-1_3b_causallm_128_torch_cpu-task.vmfb

Benchmark opt-1_3b_causallm_128_torch_cpu-task.vmfb:

TRACY_NO_EXIT=1 iree-benchmark-module --module=opt-1_3b_causallm_128_torch_cpu-task.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=50 --task_topology_max_group_count=16

capture and profile the trace.

Here is a screenshot of the dispatches from the profile statistics, ordered by total runtime:

yzhang93 commented 1 year ago

Do you have tuned results for dispatch 17 and 18? Looks like we should first boil down these ones.

monorimet commented 1 year ago

Do you have tuned results for dispatch 17 and 18? Looks like we should first boil down these ones.

Tuning these dispatches for tile/workgroup sizes does not yield a significant performance delta, unfortunately.

monorimet commented 1 year ago

cc @MaheshRavishankar @hanhanW

Current status:

The sequence length = 8 case should see significant improvements from @bjacob 's work; see this discord thread

We are discussing the matmul cases where M > 16, as the associated optimization space is drastically different from the narrow matmul cases. (ideally the existing codegen will yield acceptable performance with the right flags -- I will update this issue with confirmation on whether the 'wider' matmuls are performant with the correct flags.)

bjacob commented 1 year ago

thanks @monorimet for the summary; FTR, the above-linked discord thread also contains suggestions of flags to use for the M>=16 case. looking forward to hearing how that performs; we can discuss next steps from there.

monorimet commented 1 year ago

Here is an updated set of artifacts and tracy profile for the case where sequence length (M dimension) = 128, at fp16 precision.

The .mlir can be found here: opt-1_3b_causallm_128_torch.mlir

The .vmfb is located here: opt-1_3b_causallm_128_torch_cpu-task.vmfb

The tracy profile is also provided here: opt128_noukernels.tracy

Case 2 - with microkernels enabled:

In this case, I was surprised to see a sharp decrease in performance. I assume that we are simply not in a case where this flag helps, but in case it seems wrong I wanted to share my results:

The .vmfb is located here: opt-1_3b_causallm_128_torch_cpu-task_ukernels.vmfb

The tracy profile is also provided here: opt128_ukernels.tracy

bjacob commented 1 year ago

The likely reason why microkernels decrease performance here is the f16 element type --- I have not yet added the optimized code paths for f16 microkernels, so it's currently very slow generic code. I plan to do so this week though, so we can retry then.

Just checking - is it intentional that f16 was the element type here? The benchmarks discussed above were f32 and the last comment even says:

Here is an updated set of artifacts and tracy profile for the case where sequence length (M dimension) = 128, at fp32 precision.

monorimet commented 1 year ago

I am also confused to see the f16 precision. Let me make sure this .mlir is fp32 precision and I will update accordingly.

Edit: Yes, my mistake, this .mlir is in half-precision. I'll post again with the fp32 profiles in a few minutes.

monorimet commented 1 year ago

Running the same flags as above with the fp32 OPT.mlir results in a segfault in iree-benchmark-module.

I will be removing flags to see if any specific ones are the culprit.

Here are the commands I'm using where the segfaults occur (these worked in fp16):

/home/ean/iree-build/tools/iree-compile ./opt-1_3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-stack-allocation-limit=256000 --iree-llvmcpu-enable-microkernels -o opt-1_3b_causallm_128_torch_cpu-task_ukernels.vmfb

TRACY_NO_EXIT=1 /home/ean/iree-build/tools/iree-benchmark-module --module=opt-1_3b_causallm_128_torch_cpu-task.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=16 --device=local-task

2023-07-11T18:34:15+00:00
Running /home/ean/iree-build/tools/iree-benchmark-module
Run on (16 X 2800.27 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 33792 KiB (x1)
Load Average: 21.13, 25.39, 20.56
Segmentation fault (core dumped)

The same occurs without the microkernels flag. I am noticing stack allocations requiring the --iree-llvmcpu-stack-allocation flag to be set at 131072 (iree-compile fails if it is set lower)

monorimet commented 1 year ago

It seems --iree-flow-enable-data-tiling was the flag causing the segfault in f32. I will upload results without this flag shortly.

MaheshRavishankar commented 1 year ago

That is the flag that triggers everything. Without that flag, the other flags dont do anything. I will take a look, but it will take me sometime (a day or so) to get around to it. We need to track down the stack allocation issue as well.

monorimet commented 1 year ago

That is the flag that triggers everything. Without that flag, the other flags dont do anything. I will take a look, but it will take me sometime (a day or so) to get around to it. We need to track down the stack allocation issue as well.

OK. In the meantime I will find out if the sequence length (M dimension in matmuls) is relevant to the segfaults-- 128 is a somewhat arbitrary choice so if we can find some other M>16 that works with data tiling then we can be at least temporarily unblocked.

monorimet commented 1 year ago

It seems the data tiling flag now causes this segfault for all sequence lengths. I will see what I can do to bisect the problem and, if possible, isolate the problematic dispatches.

MaheshRavishankar commented 1 year ago

@monorimet I can take a look at this this week. I have to finish a few things before I can get to it. So I'll post here when I get to this.

monorimet commented 1 year ago

OK. I have tracked the segfault down to dispatch 25 (correct me if the the numbering could be wrong, I'm not sure if the number in the function signature is the same as the index, but the segfault begins to occur with flow-level data tiling starting at index 25 as shown:)

monorimet commented 1 year ago

I was able to compile and run dispatch 25 with the following input:

(shark.venv) ean@ean-highmem:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-compile ./opt-1_3b-causallm_cpu_dispatches/module_forward_dispatch_25_embedded_elf_x86_64_benchmark.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-stack-allocation-limit=131072 --iree-flow-enable-data-tiling -o opt-1_3b_causallm_128_dispatch_25_tiled.vmfb

(shark.venv) ean@ean-highmem:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-benchmark-module --module=./opt-1_3b_causallm_128_dispatch_25_tiled.vmfb

Yielding the following results:

2023-07-12T23:56:22+00:00
Running /home/ean/iree-build/tools/iree-benchmark-module
Run on (16 X 2800.27 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 33792 KiB (x1)
Load Average: 0.00, 0.00, 0.00
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                              Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_forward_dispatch_25_embedded_elf_x86_64_forward_dispatch_25_matmul_32x8192x2048_f32/process_time/real_time      10150 us        76696 us           52 items_per_second=98.5238/s

This behavior is replicated on sequence lengths 8, 16, 32, and 128 -- It isn't very surprising as the data tiling is happening at the flow level.

I will back up to the flow level and do some more poking around, and try a few other cases tomorrow in f16 / compare with pytorch in the meantime -- please let me know if I can focus my efforts anywhere to help @MaheshRavishankar

MaheshRavishankar commented 1 year ago

I was able to compile and run dispatch 25 with the following input:

(shark.venv) ean@ean-highmem:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-compile ./opt-1_3b-causallm_cpu_dispatches/module_forward_dispatch_25_embedded_elf_x86_64_benchmark.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-stack-allocation-limit=131072 --iree-flow-enable-data-tiling -o opt-1_3b_causallm_128_dispatch_25_tiled.vmfb

(shark.venv) ean@ean-highmem:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-benchmark-module --module=./opt-1_3b_causallm_128_dispatch_25_tiled.vmfb

Yielding the following results:

2023-07-12T23:56:22+00:00
Running /home/ean/iree-build/tools/iree-benchmark-module
Run on (16 X 2800.27 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 33792 KiB (x1)
Load Average: 0.00, 0.00, 0.00
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                              Time             CPU   Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
BM_forward_dispatch_25_embedded_elf_x86_64_forward_dispatch_25_matmul_32x8192x2048_f32/process_time/real_time      10150 us        76696 us           52 items_per_second=98.5238/s

This behavior is replicated on sequence lengths 8, 16, 32, and 128 -- It isn't very surprising as the data tiling is happening at the flow level.

I will back up to the flow level and do some more poking around, and try a few other cases tomorrow in f16 / compare with pytorch in the meantime -- please let me know if I can focus my efforts anywhere to help @MaheshRavishankar

Could you just create an issue with the dispatch itself on IREE. If you see the IR after iree-flow-outline-dispatch-region you will see a bunch of flow.executable ops. You can take the func.func that is within the dispatch that is the offending dispatch and pretty easily recreate a small repro. If you post even just the func.func within the flow.executable, that should be a good enough starting point. I think the issue is with respect to pack and unpack fusion. We might be fusing too aggressively.

bjacob commented 1 year ago

Also, --mlir-elide-elementsattrs-if-larger=10 (or some such value) helps generate testcases that are not bloated by large constant data. And --iree-util-zero-fill-elided-attrs to make those elided constant buffers be treated as filled with zeros.

monorimet commented 1 year ago

OK, I have filed the issue in IREE. For the segfaults we can move discussion to that issue until resolved.

monorimet commented 1 year ago

I have adapted a script for running a perf comparison (SHARK/IREE vs. PyTorch) for opt1.3b causallm inference.

Without data-tiling (with SHARK's default iree cpu flags -- I can dig these out if anyone is interested) we achieve the following:

(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ python opt_perf_comparison.py
Loading flatbuffer at opt_1-3b_causallm_8_torch_cpu.vmfb as a mmapped file
[DEBUG] setting iree runtime flags for cpu:
--task_topology_max_group_count=14
--- Took 0.2379467487335205 seconds to load Shark.
prompt: What is the meaning of life?
prompt: Tell me something you don't know.
prompt: What does Xilinx do?
prompt: What is the mass of earth?
prompt: What is a poem?
prompt: What is recursion?
prompt: Tell me a one line joke.
prompt: Who is Gilgamesh?
prompt: Tell me something about cryptocurrency.
prompt: How did it all begin?
--- Took 4.178373336791992 seconds to run Shark.
--- Took 5.567362308502197 seconds to load Huggingface.
prompt: What is the meaning of life?
prompt: Tell me something you don't know.
prompt: What does Xilinx do?
prompt: What is the mass of earth?
prompt: What is a poem?
prompt: What is recursion?
prompt: Tell me a one line joke.
prompt: Who is Gilgamesh?
prompt: Tell me something about cryptocurrency.
prompt: How did it all begin?
--- Took 3.4269025325775146 seconds to run Huggingface.

--- Took 4.178373336791992 seconds to run Shark.
--- Took 3.4269025325775146 seconds to run Huggingface.

bjacob commented 1 year ago

@monorimet - With https://github.com/openxla/iree/issues/14398 now fixed, here are some benchmark results to give a flavor of performance to expect. Note - testing on a Intel Skylake-XEON CPU with AVX-512. Compiling with --iree-llvmcpu-target-cpu=skylake-avx512. Command lines as in the original PR description above.

Without data-tiling and ukernels: 515 ms
With data-tiling but not ukernels: 72 ms
With data-tiling and ukernels: 3100 ms

So, data-tiling alone is a ~ 8x speedup. Ukernels alone are not yet good. But I'll get to that now, and it will be at least as fast as non-ukernels and in some cases faster. What's almost certainly happening here is that this particular model is f32, and f32 matmuls on ISAs like AVX-512 are what default codegen is good at. As soon as we depart from that, e.g. f16, things are more challenging for default codegen and the ukernels become more of a win.

jpienaar commented 1 year ago

Nice! Are these the narrow shapes case? E.g., mostly I've seen ukernels doing rather well :)

bjacob commented 1 year ago

Nice! Are these the narrow shapes case? E.g., mostly I've seen ukernels doing rather well :)

Oh I get it now - no it's not narrow (it's opt-1_3b_causallm_128_torch.mlir which has sequence length 128) but contrary to what I assumed in my previous comment, it is a f16 model. So neither the default codegen nor the ukernels are good at the moment, but the ukernels are even worse, as they are running completely slow generic code. OK, didn't remember we were already looking at f16 --- i'll reorder my queue so f16 ukernels come sooner.

MaheshRavishankar commented 1 year ago

Nice! Are these the narrow shapes case? E.g., mostly I've seen ukernels doing rather well :)

Oh I get it now - no it's not narrow (it's opt-1_3b_causallm_128_torch.mlir which has sequence length 128) but contrary to what I assumed in my previous comment, it is a f16 model. So neither the default codegen nor the ukernels are good at the moment, but the ukernels are even worse, as they are running completely slow generic code. OK, didn't remember we were already looking at f16 --- i'll reorder my queue so f16 ukernels come sooner.

Whats the target here? Without ukernels but data tiling, this is at 72 us (maybe 515us is just so bad, that this is unfair comparison)

bjacob commented 1 year ago

I'll let Nod decide if there's a specific target; all I know is we still have plenty of room to run :-)

monorimet commented 1 year ago

Nice! Are these the narrow shapes case? E.g., mostly I've seen ukernels doing rather well :)

Oh I get it now - no it's not narrow (it's opt-1_3b_causallm_128_torch.mlir which has sequence length 128) but contrary to what I assumed in my previous comment, it is a f16 model. So neither the default codegen nor the ukernels are good at the moment, but the ukernels are even worse, as they are running completely slow generic code. OK, didn't remember we were already looking at f16 --- i'll reorder my queue so f16 ukernels come sooner.

The first case we wanted to meet/beat pytorch performance on was OPT1.3b in fp32 precision. With data tiling, I do see a significant improvement in e2e execution time (tested on this iree SHA)

The following are e2e benchmarks on OPT-1.3b in fp32 precision, at sequence length 128, with avx512 instructions enabled. I will preface each result with the reproduction commands.

Link to opt_1-3b_causallm_128_torch.mlir

Case 1: No data tiling, no microkernels

iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-stack-allocation-limit=140000 -o opt_1-3b_causallm_128_torch_cpu_base.vmfb

iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_base.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=16 --device=local-task

2023-07-18T11:06:19-07:00
Running /home/ean/SHARK/shark.venv/lib/python3.11/site-packages/iree/runtime/scripts/iree_benchmark_module/../../iree-benchmark-module
Run on (16 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.33, 0.38, 0.46
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time              2072 ms        15921 ms            1 items_per_second=0.482637/s
BM_forward/process_time/real_time              2060 ms        15901 ms            1 items_per_second=0.485382/s
BM_forward/process_time/real_time              2068 ms        16012 ms            1 items_per_second=0.483572/s
BM_forward/process_time/real_time              2052 ms        15923 ms            1 items_per_second=0.4873/s
BM_forward/process_time/real_time              2089 ms        16011 ms            1 items_per_second=0.478638/s
BM_forward/process_time/real_time              2177 ms        15987 ms            1 items_per_second=0.459379/s
BM_forward/process_time/real_time              2058 ms        15968 ms            1 items_per_second=0.485893/s
BM_forward/process_time/real_time              2042 ms        15825 ms            1 items_per_second=0.489682/s
BM_forward/process_time/real_time              2049 ms        15889 ms            1 items_per_second=0.488093/s
BM_forward/process_time/real_time              2053 ms        15936 ms            1 items_per_second=0.487093/s
BM_forward/process_time/real_time_mean         2072 ms        15937 ms           10 items_per_second=0.482767/s
BM_forward/process_time/real_time_median       2059 ms        15930 ms           10 items_per_second=0.485638/s
BM_forward/process_time/real_time_stddev       39.2 ms         58.9 ms           10 items_per_second=8.79873m/s
BM_forward/process_time/real_time_cv           1.89 %          0.37 %            10 items_per_second=1.82%

Case 2: Data tiling, no ukernels:

iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-stack-allocation-limit=140000 -o opt_1-3b_causallm_128_to
rch_cpu_tiled.vmfb

iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_tiled.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=14 --device=local-task

2023-07-18T12:29:20-07:00
Running /home/ean/SHARK/shark.venv/lib/python3.11/site-packages/iree/runtime/scripts/iree_benchmark_module/../../iree-benchmark-module
Run on (16 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 2.18, 1.57, 0.80
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time               762 ms         5892 ms            1 items_per_second=1.31257/s
BM_forward/process_time/real_time               764 ms         5918 ms            1 items_per_second=1.30817/s
BM_forward/process_time/real_time               764 ms         5911 ms            1 items_per_second=1.3083/s
BM_forward/process_time/real_time               778 ms         5893 ms            1 items_per_second=1.28507/s
BM_forward/process_time/real_time               766 ms         5936 ms            1 items_per_second=1.30487/s
BM_forward/process_time/real_time               764 ms         5934 ms            1 items_per_second=1.30868/s
BM_forward/process_time/real_time               763 ms         5926 ms            1 items_per_second=1.30978/s
BM_forward/process_time/real_time               765 ms         5929 ms            1 items_per_second=1.30763/s
BM_forward/process_time/real_time               766 ms         5924 ms            1 items_per_second=1.30607/s
BM_forward/process_time/real_time               762 ms         5895 ms            1 items_per_second=1.31302/s
BM_forward/process_time/real_time_mean          765 ms         5916 ms           10 items_per_second=1.30642/s
BM_forward/process_time/real_time_median        764 ms         5921 ms           10 items_per_second=1.30824/s
BM_forward/process_time/real_time_stddev       4.70 ms         17.1 ms           10 items_per_second=7.91746m/s
BM_forward/process_time/real_time_cv           0.61 %          0.29 %            10 items_per_second=0.61%

Case 3: Data tiling and ukernels:

iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-stack-allocation-limit=140000 --iree-llvmcpu-enable-microkernels -o opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb

iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=14 --device=local-task

2023-07-18T12:33:04-07:00
Running /home/ean/SHARK/shark.venv/lib/python3.11/site-packages/iree/runtime/scripts/iree_benchmark_module/../../iree-benchmark-module
Run on (16 X 2093.89 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.85, 1.23, 0.84
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time               751 ms         5786 ms            1 items_per_second=1.33228/s
BM_forward/process_time/real_time               750 ms         5808 ms            1 items_per_second=1.33304/s
BM_forward/process_time/real_time               750 ms         5814 ms            1 items_per_second=1.33299/s
BM_forward/process_time/real_time               748 ms         5775 ms            1 items_per_second=1.33752/s
BM_forward/process_time/real_time               753 ms         5824 ms            1 items_per_second=1.32875/s
BM_forward/process_time/real_time               753 ms         5773 ms            1 items_per_second=1.32838/s
BM_forward/process_time/real_time               747 ms         5768 ms            1 items_per_second=1.33878/s
BM_forward/process_time/real_time               750 ms         5791 ms            1 items_per_second=1.33385/s
BM_forward/process_time/real_time               752 ms         5801 ms            1 items_per_second=1.32931/s
BM_forward/process_time/real_time               750 ms         5803 ms            1 items_per_second=1.33333/s
BM_forward/process_time/real_time_mean          750 ms         5794 ms           10 items_per_second=1.33282/s
BM_forward/process_time/real_time_median        750 ms         5796 ms           10 items_per_second=1.33301/s
BM_forward/process_time/real_time_stddev       1.95 ms         18.8 ms           10 items_per_second=3.46296m/s
BM_forward/process_time/real_time_cv           0.26 %          0.32 %            10 items_per_second=0.26%

So in my case the tiled + ukernels mode seems to produce the best results.

Examining performance delta vs. pytorch is a bit tricky -- we have to feed an input of 127 tokens for the pytorch model to be comparable to our sequence length 128 model. Evidently padding with the tokenizer doesn't seem to stop PyTorch from using the smallest possible model for optimal performance. This is generally equivalent in behavior to the dynamic path in torch-mlir/IREE stack.

Since we are looking at sequence length 128, I've just run the performance comparison with excerpts of the declaration of independence with 127 words each, for 5 iterations:

(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ python opt_perf_comparison.py
Loading flatbuffer at opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb as a mmapped file
[DEBUG] setting iree runtime flags for cpu:
--task_topology_max_group_count=14
--- Took 0.23573708534240723 seconds to load Shark.
prompt: We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit of Happiness.--That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, --That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness. Prudence, indeed, will dictate that Governments long established should not be changed for light and transient causes
prompt: <truncated>
prompt: <truncated>
prompt: <truncated>
prompt: <truncated>
--- Took 4.245878458023071 seconds to run Shark.
--- Took 5.405649662017822 seconds to load Huggingface.
prompt: <truncated>
prompt: <truncated>
prompt: <truncated>
prompt: <truncated>
prompt: <truncated>
--- Took 3.9855310916900635 seconds to run Huggingface.

4.24 vs. 3.9 is really quite good! Is there anything I've missed in latest IREE, from these reproducers, that could push us past pytorch performance?

MaheshRavishankar commented 1 year ago

https://github.com/openxla/iree/pull/13822 might help a bit

bjacob commented 1 year ago

So in my case the tiled + ukernels mode seems to produce the best results.

This is because this is a f32 model, and the sequence length is high enough that this doesn't depend on microkernels handling of narrow cases. So there, as long as you're data-tiling, both pure codegen and microkernels perform well, with microkernels being even a little faster --- that's consistent with what we've observed on other models.

When the data type is not f32, pure codegen tends to struggle a bit, and microkernels can shine if they have a dedicated code path for that case, or be extremely slow if they don't --- that must sound scary, but it's not, because it's a problem that we fix once and for all per element type and then it applies to all models.

When the shapes are narrow (when the sequence length is small), codegen tends to adapt gracefully, but microkernels don't currently have fast code for narrow cases, so that's the other thing I want to fix very soon in microkernels.

Further performance gains beyond that point will come from:

Switching to other element types as appropriate for each workload and hardware target. For example, some recent x86 CPUs have support for bf16 arithmetic, so they would benefit from switching to that (even more cutting-edge is support for f16 arithmetic, currently limited to Intel Sapphire Rapids).
Better distribution heuristics (cache friendliness and multi-threading) and more fusions as appropriate (both topics tracked by @MaheshRavishankar).
Looking at profiles to see if there's something else than matmuls that is being slow.

Examining performance delta vs. pytorch is a bit tricky

So I understand correctly this log: Huggingface is the pytorch value ?

monorimet commented 1 year ago

openxla/iree#13822 might help a bit

Thanks. I am building with tracy on latest IREE to get a trace of the tiled ukernels case, so I will share results with iree-llvmcpu-reassociate-fp-reductions soon.

MaheshRavishankar commented 1 year ago

openxla/iree#13822 might help a bit

Thanks. I am building with tracy on latest IREE to get a trace of the tiled ukernels case, so I will share results with iree-llvmcpu-reassociate-fp-reductions soon.

WIth latest ToT that flag is on by default.

monorimet commented 1 year ago

So I understand correctly this log: Huggingface is the pytorch value ?

Yes, that is the value we get from pytorch runtime.

monorimet commented 1 year ago

Sorry to say I seem to get better performance with --iree-llvmcpu-reassociate-fp-reductions=False:

--iree-llvmcpu-reassociate-fp-reductions=True (default)

(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-stack-allocation-limit=140000
--iree-llvmcpu-enable-microkernels -o opt_1-3b_causallm_128_torch_cpu_tiled_ukernels_fp_reassoc.vmfb

(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_tiled_ukernels_fp_reassoc.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=14 --device=local-task

2023-07-18T13:32:23-07:00
Running /home/ean/SHARK/shark.venv/lib/python3.11/site-packages/iree/runtime/scripts/iree_benchmark_module/../../iree-benchmark-module
Run on (16 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.67, 3.14, 7.56
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time               798 ms         5921 ms            1 items_per_second=1.25344/s
BM_forward/process_time/real_time               759 ms         5879 ms            1 items_per_second=1.31727/s
BM_forward/process_time/real_time               762 ms         5892 ms            1 items_per_second=1.31168/s
BM_forward/process_time/real_time               754 ms         5837 ms            1 items_per_second=1.32597/s
BM_forward/process_time/real_time               759 ms         5862 ms            1 items_per_second=1.31826/s
BM_forward/process_time/real_time               760 ms         5873 ms            1 items_per_second=1.31539/s
BM_forward/process_time/real_time               761 ms         5879 ms            1 items_per_second=1.31371/s
BM_forward/process_time/real_time               762 ms         5895 ms            1 items_per_second=1.3116/s
BM_forward/process_time/real_time               762 ms         5897 ms            1 items_per_second=1.3127/s
BM_forward/process_time/real_time               763 ms         5898 ms            1 items_per_second=1.3103/s
BM_forward/process_time/real_time_mean          764 ms         5883 ms           10 items_per_second=1.30903/s
BM_forward/process_time/real_time_median        761 ms         5886 ms           10 items_per_second=1.31321/s
BM_forward/process_time/real_time_stddev       12.1 ms         22.9 ms           10 items_per_second=0.0200596/s
BM_forward/process_time/real_time_cv           1.59 %          0.39 %            10 items_per_second=1.53%

--iree-llvmcpu-reassociate-fp-reductions=False:

(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ /home/ean/iree-build/tools/iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-stack-allocation-limit=140000 --iree-llvmcpu-enable-microkernels --iree-llvmcpu-reassociate-fp-reductions=False -o opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb

(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=14 --device=local-task

2023-07-18T13:27:49-07:00
Running /home/ean/SHARK/shark.venv/lib/python3.11/site-packages/iree/runtime/scripts/iree_benchmark_module/../../iree-benchmark-module
Run on (16 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.92, 6.57, 9.84
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time               751 ms         5790 ms            1 items_per_second=1.33219/s
BM_forward/process_time/real_time               754 ms         5833 ms            1 items_per_second=1.32702/s
BM_forward/process_time/real_time               756 ms         5843 ms            1 items_per_second=1.32273/s
BM_forward/process_time/real_time               758 ms         5836 ms            1 items_per_second=1.31952/s
BM_forward/process_time/real_time               752 ms         5799 ms            1 items_per_second=1.32904/s
BM_forward/process_time/real_time               754 ms         5821 ms            1 items_per_second=1.32564/s
BM_forward/process_time/real_time               754 ms         5808 ms            1 items_per_second=1.32647/s
BM_forward/process_time/real_time               755 ms         5821 ms            1 items_per_second=1.32446/s
BM_forward/process_time/real_time               756 ms         5832 ms            1 items_per_second=1.32258/s
BM_forward/process_time/real_time               757 ms         5846 ms            1 items_per_second=1.32073/s
BM_forward/process_time/real_time_mean          755 ms         5823 ms           10 items_per_second=1.32504/s
BM_forward/process_time/real_time_median        755 ms         5826 ms           10 items_per_second=1.32505/s
BM_forward/process_time/real_time_stddev       2.20 ms         18.8 ms           10 items_per_second=3.86125m/s
BM_forward/process_time/real_time_cv           0.29 %          0.32 %            10 items_per_second=0.29%

With debug iree-runtime builds, the results are a bit more sporadic so the two cases seem quite similar (I can share those numbers if desired)

Tracy profile for the fastest configuration (latest IREE, seqlen 128, fp32, avx512) (link)

Dispatch list by total time:

bjacob commented 1 year ago

Are the matrix multiplications in this model involving a matrix of constant weights (as is the case in many NN inference workloads) as opposed to runtime values being multiplied by runtime values (as is the case in some recent NN architectures like transformers) ?

If some matmul operands are constant data, then the corresponding set_encoding dispatches are running on constant data and are prime candidates for being constant-evaluated (--iree-opt-const-eval).

bjacob commented 1 year ago

Hmm yes it really is lots of big constant matrices (just looking at the f16 model here but that should be the same).

  func.func @forward(%arg0: tensor<1x128xi64>, %arg1: tensor<1x128xi64>) -> tensor<1x128x50272xf16> {
    %cst = arith.constant dense_resource<__elided__> : tensor<2048xf16>
    %cst_0 = arith.constant dense_resource<__elided__> : tensor<2048xf16>
    %cst_1 = arith.constant dense_resource<__elided__> : tensor<2048x8192xf16>
    %cst_2 = arith.constant dense_resource<__elided__> : tensor<8192xf16>
    %cst_3 = arith.constant dense_resource<__elided__> : tensor<8192x2048xf16>
    %cst_4 = arith.constant dense_resource<__elided__> : tensor<2048xf16>
    %cst_5 = arith.constant dense_resource<__elided__> : tensor<2048xf16>

For example, %cst_1 is used by a linalg.generic performing a transposition,

 %2040 = linalg.generic {indexing_maps = [#map4, #map17], iterator_types = ["parallel", "parallel"]} ins(%cst_1 : tensor<2048x8192xf16>) outs(%153 : tensor<8192x2048xf16>) {
    ^bb0(%in: f16, %out: f16):
      linalg.yield %in : f16
    } -> tensor<8192x2048xf16>

and that in turn becomes an operand to matmul:

    %2041 = linalg.matmul ins(%2039, %2040 : tensor<128x8192xf16>, tensor<8192x2048xf16>) outs(%73 : tensor<128x2048xf16>) -> tensor<128x2048xf16>

So when IREE data-tiles that matmul, it creates a set_encoding dispatch consuming %2040 and running set_encoding on it, which in codegen (MaterializeEncodingPass) becomes a tensor.pack op. That dispatch is running on constant data (and clearly there are many more in this model) so I really expect that you'll see a benefit from --iree-opt-const-eval

monorimet commented 1 year ago

Are the matrix multiplications in this model involving a matrix of constant weights (as is the case in many NN inference workloads) as opposed to runtime values being multiplied by runtime values (as is the case in some recent NN architectures like transformers) ?

If some matmul operands are constant data, then the corresponding set_encoding dispatches are running on constant data and are prime candidates for being constant-evaluated (--iree-opt-const-eval).

Is the flag all that's necessary to constant-evaluate the set_encoding dispatches? I got slightly less performant results with --iree-opt-const-eval:

(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ iree-compile ./opt_1-3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-llvmcpu-stack-allocation-limit=140000 --iree-llvmcpu-enable-microkernels --iree-llvmcpu-reassociate-fp-reductions=False --iree-opt-const-eval -o opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb
(shark.venv) ean@sharkbox:~/SHARK/tank/examples/opt$ TRACY_NO_EXIT=1 /home/ean/iree-build/tools/iree-benchmark-module --module=./opt_1-3b_causallm_128_torch_cpu_tiled_ukernels.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=14 --device=local-task
2023-07-18T14:13:25-07:00
Running /home/ean/iree-build/tools/iree-benchmark-module
Run on (16 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 1024 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.12, 0.16, 0.67
***WARNING*** Library was built as DEBUG. Timings may be affected.
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
BM_forward/process_time/real_time               813 ms         6037 ms            1 items_per_second=1.23032/s
BM_forward/process_time/real_time               775 ms         6013 ms            1 items_per_second=1.28964/s
BM_forward/process_time/real_time               774 ms         5997 ms            1 items_per_second=1.2922/s
BM_forward/process_time/real_time               775 ms         5990 ms            1 items_per_second=1.29072/s
BM_forward/process_time/real_time               775 ms         5998 ms            1 items_per_second=1.29076/s
BM_forward/process_time/real_time               771 ms         5954 ms            1 items_per_second=1.29677/s
BM_forward/process_time/real_time               774 ms         5970 ms            1 items_per_second=1.29181/s
BM_forward/process_time/real_time               773 ms         5966 ms            1 items_per_second=1.29397/s
BM_forward/process_time/real_time               774 ms         5976 ms            1 items_per_second=1.29195/s
BM_forward/process_time/real_time               775 ms         5988 ms            1 items_per_second=1.28976/s
BM_forward/process_time/real_time_mean          778 ms         5989 ms           10 items_per_second=1.28579/s
BM_forward/process_time/real_time_median        774 ms         5989 ms           10 items_per_second=1.29129/s
BM_forward/process_time/real_time_stddev       12.3 ms         24.3 ms           10 items_per_second=0.0196042/s
BM_forward/process_time/real_time_cv           1.58 %          0.41 %            10 items_per_second=1.52%

bjacob commented 1 year ago

hum, nothing out of the top of my head. need to look into this.

MaheshRavishankar commented 1 year ago

There are a couple of things we need to do to get the const eval to work here. Its not a simple flag flip. One we need a way for the MaterializeEncodingPass to use the target information other than what is specified on the dispatch (basically have an "override" attribute somewhere and set that attribute during const eval so that the const eval can run those dispatches. Once that is done, we just need to make const eval hoisting to hoist out the const -> set_encoding .

bjacob commented 1 year ago

@MaheshRavishankar thanks for the explanation, if you file an issue with a ~ 4x expanded version of that to get me started, i might be able to try.

I realized meanwhile that we also needed --iree-opt-const-expr-hoisting for part of what you're saying here, but I was missing that there would be something specific to set_encoding here.

MaheshRavishankar commented 1 year ago

@MaheshRavishankar thanks for the explanation, if you file an issue with a ~ 4x expanded version of that to get me started, i might be able to try.

I dont know what the full solution is, but definitely worth starting an issue and describing what I know, and getting Ben/Stella's help on the remaining. Stay tuned.

I realized meanwhile that we also needed --iree-opt-const-expr-hoisting for part of what you're saying here, but I was missing that there would be something specific to set_encoding here.

MaheshRavishankar commented 1 year ago

There is already an issue for this https://github.com/openxla/iree/issues/11360. Ill add some things there

nod-ai / SHARK

OPT-1.3b performance tracker #1589