luau-lang / luau

A fast, small, safe, gradually typed embeddable scripting language derived from Lua
https://luau-lang.org
MIT License
3.8k stars 352 forks source link

CodeGen: Preserve known tags for LOAD_TVALUE synthesized from LOADK #1201

Closed zeux closed 3 months ago

zeux commented 3 months ago

When lowering LOADK for booleans/numbers/nils, we deconstruct the operation using STORE_TAG which informs the rest of the optimization pipeline about the tag of the value. This is helpful to remove various tag checks.

When the constant is a string or a vector, we just use LOAD_TVALUE/STORE_TVALUE. For strings, this could be replaced by pointer load/store, but for vectors there's no great alternative using current IR ops; in either case, the optimization needs to be carefully examined for profitability as simply copying constants into registers for function calls could become more expensive.

However, there are cases where it's still valuable to preserve the tag. For vectors, doing any math with vector constants contains tag checks that could be removed. For both strings and vectors, storing them into a table has a barrier that for vectors could be elided, and for strings could be simplified as there's no need to confirm the tag.

With this change we now carry the optional tag of the value with LOAD_TVALUE. This has no performance effect on existing benchmarks but does reduce the generated code for benchmarks by ~0.1%, and it makes vector code more efficient (~5% lift on X64 log1p approximation).

zeux commented 3 months ago

This change results in ~4% lift on A64 and ~5% lift on X64 for vector approximation which is the main goal here, but it seems like it doesn't hurt for strings (the code size reduction on benchmark code is due to strings exclusively - I don't have access to a large Luau codebase so can't measure broad impact but assume it's a slight codesize reduction as well) so it's applied unconditionally.

zeux commented 3 months ago

fwiw assembly diff for A64 for vector approximation:

     8:     local r: vector = vector(0, 0, 0)
 LOADK R1 K0 [0, 0, 0]
 vector <- vector, any
-#   %4 = LOAD_TVALUE K0                                       ; useCount: 2, lastUse: %14
+#   %4 = LOAD_TVALUE K0, 0i, tvector                          ; useCount: 1, lastUse: %14
  ldr         q31,[x22]
-#   STORE_TVALUE R1, %4                                       ; %5
- str         q31,[x25,#16]
     9:     local a: vector = x
 MOVE R2 R0
 vector <- vector, any
 .L22:
-#   %6 = LOAD_TVALUE R0                                       ; useCount: 8, lastUse: %123
+#   %6 = LOAD_TVALUE R0                                       ; useCount: 7, lastUse: %123
  ldr         q30,[x25]
-#   STORE_TVALUE R2, %6                                       ; %7
- str         q30,[x25,#32]
    10:     r += a -- todo: * 1 doesn't fold yet
 ADD R1 R1 R2
 vector <- vector, vector
 .L23:
-#   %8 = LOAD_TAG R1                                          ; useCount: 1, lastUse: %9
- ldr         w17,[x25,#28]
-#   CHECK_TAG %8, tvector, exit(2)                            ; %9
- cmp         w17,#4
- b.ne        .L24
 #   %14 = ADD_VEC %4, %6                                      ; useCount: 1, lastUse: %39
  fadd        v31.4s,v31.4s,v30.4s
    11:     a *= x
 MUL R2 R2 R0
 vector <- vector, vector
vegorov-rbx commented 3 months ago

I think this change triggers a bug in another optimization and causes tests on Ubuntu to fail. I believe it will be fixed by the sync PR today, so will have to be rebased on that.

zeux commented 3 months ago

My prediction is that the sync will hit the same issue but we'll see :)

The error shows up in a slightly different run every time afaict - the first run of this PR actually failed the main test execution (no codegen). A rerun just now failed in a flags=false no-codegen run:

image

The failure happens before any tests actually run (no doctest banner which is printed first). This is an ASAN issue with ASLR that I also have locally on my system, it's spuriously failing test runs on master as well as on this branch. It's been reported here https://github.com/google/sanitizers/issues/1614 and fixed in some versions of clang/gcc but if you don't have the latest build then you're going to run into issues...

Locally, sudo sysctl -w vm.mmap_rnd_bits=28 fixes the issue completely - if the sync still sees it we may need to add this to the build steps before running any tests for the time being...

zeux commented 3 months ago

Judging from the references that are starting to appear on the linked issue, this started happening on GHA recently. Perhaps the number of randomized bits set to 32 is a recent GHA-side or Ubuntu-side config change.

zeux commented 3 months ago

https://github.com/luau-lang/luau/pull/1203

vegorov-rbx commented 3 months ago

Thanks for figuring that out.