luau-lang / luau

A fast, small, safe, gradually typed embeddable scripting language derived from Lua
https://luau.org
MIT License
4.01k stars 374 forks source link

Float4 vectors #196

Closed petrihakkinen closed 2 years ago

petrihakkinen commented 2 years ago

Are there any plans to support float4 natively? I understand float4 does not fit into the tagged value currently. However, there are major use cases for float4s like HDR RGBA colors and quaternions (for representing 3D rotations), which could be much faster with a native float4 type.

I've previously expanded tagged values in Lua 5.1 and 5.4 to contain a float4 vector without significant performance penalty on x64. Eliminating the GC pressure outweighs the slowdown of a few percent that we noticed in our benchmarks. This change increased the tagged value from 16 bytes to 24 bytes. Naturally, Luau perf profile might be different because it is so highly optimized compared to vanilla Lua...

Another alternative is to implement float4s as userdata but that will have potentially heavy GC cost in some applications such as games.

If adding native float4 is not desirable for Roblox, are you open to the idea of supporting it optionally via compile time conditional (i.e. some ifdef)?

LoganDark commented 2 years ago

It's not currently possible to store float4s as values because of the size of TValue, and increasing the size by 50% (!) is probably not an option considering what they decided to do for float3s (pack the extra into some spare padding). Even if the performance penalty is not 'significant' in your testing.

Opinion

I wouldn't say quaternions are a particularly interesting format for rotations, you're much better off using matrices (even though they do use more space), like Roblox. Just like a 3D isometry can be represented by a 4x4 matrix (think CFrame), a 3D rotation can be represented by a 3x3 matrix. Though if your chosen engine does not support matrices I suppose there's no reason to convert back and forth.

As for colors, what stops you from storing RGB in a vector and A as a separate number? Abuse the fact that Lua(u) can pass around multiple values at a time.

Edit: It's probably possible to use NaN boxing for the last component in a Vector4, but this would complicate every piece of code that wants to check tt.

petrihakkinen commented 2 years ago

I've used game engines that use matrices and some that use quaternions. I think quaternions are well suited in many cases and nicer to work with, because of their compactness.

Along the same lines, splitting colors to RGB and A components is not ideal. Code that deals with RGBA colors get more complex and you need to pass on two args everywhere. This gets particularly nasty if you have functions that deal with more than one RGBA colors as arguments.

I understand the proposal would increase the size of the tagged value, but the benefits are major in applications that deal with a lot of float4s (many games do).

I've read through the Luau VM code and it seems adding float4 support as an option (ifdef) could be done relatively cleanly. We could probably start working on a pull request in the near future, but we'd like to first hear if this would be likely accepted or not.

LoganDark commented 2 years ago

we'd like to first hear if this would be likely accepted or not.

It's likely you'd have to keep a forked version of Luau as this is probably not really an acceptable tradeoff for Roblox since they use CFrames instead of quaternions. But we'll have to wait until @zeux gets a chance to put down his thoughts before we can be sure.

zeux commented 2 years ago

So here's some rough thoughts; don't treat this is as a design or anything, or a promise that we'll support this, but this may be a reasonable path forward.

As an aside, Roblox doesn't use quaternions for transforms, and that's pretty unfortunate. It's not consequential here since we'd go with cframe=vec3+quat which is probably too large for TValue anyway, but if I could roll the clock back 15 years ago this is one of the things I'd change.

Increasing the size of Value unconditionally is out of the question. There are important performance benefits to keeping TValue power of two as it improves codegen in places where we need to convert between pointers and indices; there are important performance benefits to keeping TValue as small as possible because it reduces cache pressure; but, most importantly, the size of TValue is a significant factor that determines the overall memory footprint, and increasing it is not acceptable when running on low-end mobile devices.

When we were looking into native support for 3-float vectors, there was an option to implement a dedicated vector type, as well as an option to implement more generic support for larger userdata. We ended up going with a vector type because embedding a type tag into the existing value while keeping its size and leaving 12 bytes of storage was awkward to combine with some performance and memory tradeoffs I won't go into; what sealed the deal was that first-class support for vector type in the VM allowed us to implement much faster arithmetics and indexing as we can define most of the core semantics in the VM.

float4 doesn't truly need this, if only for the reason that storing something like a quaternion in a float4 requires a different set of operations. So I think the real question here is "how do we implement quaternion/color/etc. types efficiently without GC pressure".

I'm optimistic that our future work on improving GC throughput, including generational+incremental GC and other tweaks, will make larger structures more sustainable to keep as heap userdata. That said, it's never going to be as fast, and it could be the case that for some applications, having access to slightly larger inline userdata would be valuable.

Which brings me to the alternative for 3-float vectors that we considered and ended up not pursuing, that is likely apt here. Today Luau exposes light userdata (one pointer worth of data, no metatable) and userdata (arbitrary data + metatable, heap allocated). Let's say there was a third type of userdata - with a metatable (crucial to support multiple different types), but stored inline in TValue. Let's call it fat userdata although a better name might be chosen.

TValue today is stored as Value, int extra, int tt. extra is used for Vector3::z storage. Let's say instead we store int extra[LUA_FATUSERDATA_WORDS], where LUA_FATUSERDATA_WORDS defaults to 1 but is configurable. Everything else is not compiled conditionally.

We add a new type tag, LUA_TFATUSERDATA; when the value is using this type tag, Value stores a pointer to the metatable for this object. We then change existing code that works with userdata objects in a mostly straightforward way. The biggest source of ambiguity is raw equality and hashing - here we can either decide to treat the user data as a byte blob for this purpose, or to prohibit these operations.

This allows the users to sacrifice memory for reduced GC pressure. Today TValue is 16 bytes, 4 of which is extra; a 32-byte value will hold 5 words of extra data, which is enough for a float4 type, or possibly another type that's 20 bytes or less. In Roblox, for example - not that we'd ever use this option for memory concerns - there's a handful of types that would have fit, like Color3 (12 bytes), UDim2 (16 bytes), Rect2D (16 bytes), etc.

The behavior of this object would be customizable through metatable as per usual. One caveat is that this object would not be collectable (in the sense that the data isn't heap allocated), but still needs to be marked (because it contains a metatable pointer). This may require small tweaks to a few conditions in GC.

A variant of this idea is to require usage of tags. This would allow to be more space-efficient - instead of storing an 8-byte metatable, we'd use 8 bytes of storage in Value plus LUA_FATUSERDATA_WORDS-1 words to store the value, and a single integer to store userdata tag plus other bits if need be. This removes the need to customize GC (the value is simply a value type at this point), adds a metatable array to global state, and allows a 24-byte TValue to store 16 bytes worth of data, which is more memory friendly.

A final variant is to require usage of tags, but instead of inventing a whole new type, repurpose lightuserdata for this. Essentially this would add tag support for lightuserdata (which is nice as it allows cheap pointer+metatable reflection without inflating TValue in the default configuration, as we'd have 8 bytes of storage by default!).

In either case we'd compile all of this code in all the time, we'd just vary the storage size. This in general would be the path that we'd want to take if we do this - extra configuration options are brittle and to the extent possible I'd like to avoid them.

LoganDark commented 2 years ago

add tag support for lightuserdata

This can be done as-is with extra, I can even open a PR for it. Interested?

petrihakkinen commented 2 years ago

@zeux thanks for the detailed reply. I can definitely see uses cases for the proposed fatuserdata feature. However, for vectors which are heavily used by games we'd be leaving a lot of performance on the table. Accessing all operations though a metatable (indexing with X/Y/Z/W, add, sub, __mul, etc.) is costly [*].

([*] Unless of course types are known at compile time and the compiler could somehow optimize away metatable accesses in this case, which would probably be tricky since contents of metatables can be tweaked at runtime.)

The way how I see it is that vectors should be a fundamental primitive type akin to numbers, and higher level constructs such as rectangles, colors, rays, etc. should be implemented on top of the vector type outside the VM (either as tables or using the fatuserdata & modules). If float4s would be second class citizens, they would be inelegantly asymmetric with float3 vectors, since only float3 were supported natively by the VM.

What it comes to vector/quat operations, I believe in minimal set of operator overloading for the basic stuff like componentwise +, -, * and indexing. Anything else like dot and cross products, quaternion multiplication, normalization etc. is better implemented as pure functions. For example, dot2, dot3, dot4, normalize2, normalize3, normalize4, cross3, quat_mul, quat_normalize. Many SIMD C++ Vector math libraries take this approach and have just one vector type. One benefit is that you can use e.g. do2/dot3 on a 4-component vector, which would ignore the extra components. This eliminates the need to convert between different vector types. Another advantage is that the code is very explicit and therefore easier to reason about for the reader of the code and also to the compiler. Typical case where a conversion can be avoided, if everything is just a "vector" as opposed to float2/3/4 etc., is projecting a homogeneous 4D vector to 3D by dividing by W.

Now, I'm not suggesting these auxiliary functions (dot2, dot3, etc.) should be part of Luau language. They can easily be implemented outside the VM by a module. But I believe first class support for 4-component vector would still be the right way.

Here's a thought experiment for you: if fatuserdata becomes a thing, would you be willing to remove vector type from the VM and implement 3D vectors as fatuserdata?

I agree that fewer #ifdefs is better generally, but in this case I think it could still be the right tradeoff. But there's a way around those ifdefs, if they're deemed unacceptable: we could add a new template argument, the width of vector values, to the VM function. What do you think about this?

To summarize, I'm proposing the following:

Note that I don't disagree with the usefulness of fatuserdata. I can see fast, native float4s (or to be more precise, configurable vector width) and fatuserdatas (for fast user defined types) co-existing peacefully.

Final point about naming, I think LUA_TFATUSERDATA gives the wrong impression that they're somehow fatter or more costly than regular userdata. A better name could perhaps be fastuserdata or userdatavalue, suggesting they're faster (no GC pressure), or that they are values instead of objects.

LoganDark commented 2 years ago

would you be willing to remove vector type from the VM and implement 3D vectors as fatuserdata?

That would inflate every TValue by 8 bytes just for the same vector3 support that already exists. Every vector would gain the ability to have a metatable set, though, which could define separate operations for colors and other things.

we could add a new template argument, the width of vector values, to the VM function.

That would not make much sense. In order for a 4-wide vector to exist, there has to be space for it in TValue. TValue is the union of all possible Lua values, and it must be able to contain anything in order for dynamically typed variables to exist. You're just asking for the size of TValue to be configurable, which is... what @zeux is proposing? But adding a "template argument" would overcomplicate things extremely, as suddenly every function which touches anything would have to be a template function, meaning compile times go waay up, FFI support goes out the window, etc.

Note that I don't disagree with the usefulness of fatuserdata. I can see fast, native float4s (or to be more precise, configurable vector width) and fatuserdatas (for fast user defined types) co-existing peacefully.

You seem to think these two things are a difference of night and day, when in reality they are not. Metamethod access is fast. Luau does some really exotic stuff to discover the metamethod very quickly. The performance documentation even has a section on it. If your metamethods are C functions, the performance hit is already not that far off from being embedded in the interpreter directly (although it won't be as fast as fastcall).


The only way to support vector4s as native value types is to expand the size of TValue, which is exactly what fatuserdata is. Fatuserdata would just be a more general implementation that would support setting individual metatables of each new value type you want to add, so it would support vector4s, RGBA colors, quaternions, whatever (since those two should already have separate multiply operations).

Fatuserdata is honestly the only way to implement what you've asked without overcomplicating everything.

petrihakkinen commented 2 years ago

@LoganDark Please re-read my previous comment. I think you're misunderstanding what I'm proposing. A float3 can be embedded in the tagged value with no increase to current size regardless of whether the operations are implemented inside or outside the VM. A float4 does not currently fit into the tagged value, I'm well aware of that and I've communicated it in my original post, and the tagged value would expand by 4 bytes (or 8 depending on alignment). This expansion would only happen if the code is conditionally compiled to enable that (either using template or preprocessor), so it would not affect Roblox or any other user of Luau unless they specifically configure it like that.

LoganDark commented 2 years ago

@LoganDark Please re-read my previous comment. I think you're misunderstanding what I'm proposing.

I read your comment perfectly fine.

A float3 can be embedded in the tagged value with no increase to current size regardless of whether the operations are implemented inside or outside the VM.

Correct. However, if they were to move float3 onto the fatuserdata system as you propose in your previous comment:

would you be willing to remove vector type from the VM and implement 3D vectors as fatuserdata?

they would waste the size of a metatable pointer, as vector3s only need one single global metatable: the one defining vector3 methods. This is what I said would waste space.

A float4 does not currently fit into the tagged value, I'm well aware of that and I've communicated it in my original post, and the tagged value would expand by 4 bytes (or 8 depending on alignment). This expansion would only happen if the code is conditionally compiled to enable that (either using template or preprocessor), so it would not affect Roblox or any other user of Luau unless they specifically configure it like that.

You proposed templates specifically and I explained why specifically templates are not feasible. Nothing I said applies to preprocessor directives. Preprocessor directives can absolutely be used to expand TValue without widespread code changes, as @zeux proposed and as I literally have an open PR partially implementing.


You seem to be arguing specifically for a native vector4 type with a single global metatable similar to vector3 and other value types. Fatuserdata would kill 2 birds with one stone (or rather, infinite birds with one stone) and it does not waste much extra space (given the benefits & flexibility) so it is obviously the better choice here. But if you don't see that then feel free to continue arguing your point - to @zeux.

I however don't find it fun or constructive to argue with you over this so I'm going to turn notifications off. Good luck with your issue.

zeux commented 2 years ago

What it comes to vector/quat operations, I believe in minimal set of operator overloading for the basic stuff like componentwise +, -, * and indexing. Anything else like dot and cross products, quaternion multiplication, normalization etc. is better implemented as pure functions

Componentwise operations don't make sense for quaternions though. I'd expect a * for quaternions to implement quaternion product - however, the same operation on vectors or colors isn't meaningful either. This is part of the attraction of vector3 vs vector4.

So that I understand what you're proposing, you'd expect a define that changes the builtin vector type from 3 floats to 4 floats while preserving existing optimizations around first-class component wise operations and no GC, at the cost of an extra 8 bytes per TValue - not introducing a new vector type?

petrihakkinen commented 2 years ago

Re: componentwise operations on quaternions.

Yeah, you're right that they don't make mathematical sense (although they could still be used for lerping if the quaternions are close enough). But the operations do make sense if they are just raw float4s.

I have a feeling we're looking at vector from slightly different POV. In my mind vectors are low level primitives and closer to what they are in shading languages than strongly typed objects that come up in this discussion ("colors", "rectangles", "quaternions", etc.).

Note that I'm not suggesting we add a notion of quaternions to the core language. But if there would be a way to implement fast float4s, it could be the perfect block for anyone who wants to use raw float4s to get the highest possible efficiency. Strongly typed vectors, quaternions, colors or other types could still be implemented using other means if the user prefers to pay the higher perf cost.

I'm currently testing the various approaches to get some data. It could very well be that metatables do not have the huge impact they have in vanilla Lua but I haven't tested it yet. Today I tried expanding tagged value to float4 (24 bytes) and the overall perf hit seems to be 5%, which is the same I got with Lua 5.4. I'll try the fatuserdata approach with metatables tomorrow. I'll report back when I get some numbers. I understand there are slim chances of this ever getting into the mainline, but I hope this data will still be useful for others having similar needs.

Btw. one of the ideas I've toyed in the past is to have different register sets for vectors and other types. This way the base tagged value type could be made very small (maybe 8 bytes by using NaN tagging). Fatter data types could be stored in different place. At one extreme the tags would be stored in a packed array (maybe less than byte per tag), and each data type would have separate arrays. This would of course imply two memory accesses per value but the upside could be better cache coherency. A bigger downside is having to somehow box values when storing them to tables, so it would be a major overhaul. Have you thought about this?

Sorry for the long comment!

petrihakkinen commented 2 years ago

"So that I understand what you're proposing, you'd expect a define that changes the builtin vector type from 3 floats to 4 floats while preserving existing optimizations around first-class component wise operations and no GC, at the cost of an extra 8 bytes per TValue - not introducing a new vector type?"

Yep. I'll send you a diff tomorrow so you get a better idea of all the places I had to touch.

petrihakkinen commented 2 years ago

Oh, one more thought before I go (sorry!), if tags and data would be stored separately, and if we had a compiler that knows the actual types like you have, we wouldn't need to even read the tags in many cases. So doubles could be packed in 8 bytes without NaN tagging, float4 into 16 bytes, etc. But I'm sure you're well ahead of me in this with your JIT plans.

zeux commented 2 years ago

if tags and data would be stored separately

Yeah this is always considered but it's difficult to go here (or to a separate register set). There's a complex balance of various considerations around the soundness of type checking, complexity of compiler/runtime, etc. Today we still live in the world of unsound types and arbitrary modifications of environments that are dependent upon by the calling module (eg Roblox exposes vector types and makes it possible to construct them with Vector3.new, but it's always possible to substitute Vector3 global via getfenv/setfenv at which point the value returned by Vector3.new can be of arbitrary type). It's also very difficult to introduce custom layouts due to pervasive use of TValue type - this does greatly simplify the runtime! while also making it constrained.

So we'll see where exactly we'll evolve but some of these decisions that are seemingly obviously good ("of course it makes more sense to store tags and values separately") end up being very nuanced.

petrihakkinen commented 2 years ago

Ok, here are the numbers. I used a Lua port of smallpt path tracer for benchmarking. All benchmarks were run on i7-9700K on Windows 10, repeating the benchmarks ten times and picking the lowest time.

Execution times for different versions:

original        15.8s       baseline
float4          16.6s       5% slower
fatuserdata 16B     23.8s       50% slower
fatuserdata 24B     24.8s       57% slower

Notes:

Proof of concept for float4s here: https://github.com/Roblox/luau/compare/master...petrihakkinen:float4?diff=unified

Implementation notes for float4:

Very rough proof of concept for fatuserdata (by Mikko Kallinen): https://github.com/Roblox/luau/compare/master...petrihakkinen:fatuserdata?diff=unified

Implementation notes for fatuserdata:

Conclusion:

The perf hit with fatuserdata is significant and it would not be acceptable for us to implement float4s with them. My opinion is that the float4 branch is the best tradeoff (at least from the options we have explored so far). Sure, there are some preprocessor conditionals involved, but the VM itself is clean. Maybe with a little bit of work some #ifdefs could be still eliminated.

Even though Roblox would not benefit from the fatter float4 vectors, probably other users of Luau would. At least for us the 5% perf hit is not problem, so we'll happily trade it for fast, garbage free float4s.

What do you think?

petrihakkinen commented 2 years ago

I noticed a few bugs in float4 branch. luaV_doarith does not set tag for vector properly and vector indexing does not work with uppercase W. I'll fix them tomorrow. Please ignore these issues for now.

petrihakkinen commented 2 years ago

Fixed luaV_doarith and uppercase W bugs in float4 branch.

zeux commented 2 years ago

Thanks! I agree that the float4 changes seem pretty minimal. Since they aren't introducing a new type and are just extending the builtin vector type it's not as much extra code as I was thinking about.

Thoughts about changing setvvalue to always accept 4 arguments but ignore last argument by default? This would maybe cleanup a bunch of code in VM/luaV_ without the use of loops (I'd like to avoid loops since the codegen in debug is likely not going to be great).

petrihakkinen commented 2 years ago

Good point! setvvalue always with 4 args is a great idea. I can make a new version tomorrow.

petrihakkinen commented 2 years ago

Cleaned up the code and submitted a pull request. Please let me know if there's anything else.

214