terralang / terra

Terra is a low-level system programming language that is embedded in and meta-programmed by the Lua programming language.
terralang.org
Other
2.73k stars 201 forks source link

In place update of vector element error #386

Open aiverson opened 5 years ago

aiverson commented 5 years ago

Trying to write into a field of a vector like v[0] = 0 produces the error message argument to operator must be an lvalue

Why isn't this supported?

elliottslaughter commented 5 years ago

Probably because vectors are supposed to map to the underlying hardware instructions. Presumably, the code you want to write is something like:

v[0] = first_value
v[1] = second_value
...

I.e. you're setting all the elements. So in the best case, this would turn into vector gather instruction or similar. Worst case it turns into a series of repeated load/store calls that the compiler might not be smart enough to combine. Better to explicitly construct the value all at once and then load it, or use the platform's native support for gathers---that way you don't run into unexpected performance degradation.

If you really do want just a convenient abstraction for a set of elements and don't specifically care about vectorization, arrays seem like a better fit.

Is the support for constructing vector values missing something, or am I mischaracterizing your use case?

aiverson commented 5 years ago

That isn't the usecase. The usecase is using vectors for geometric points. They get stored in a field of a point struct that implements all the appropriate math operations and uses entrymissing to allow accessing them by name. Points will typically be constructed properly using a vector constructor from all the components and most arithmetic operations will leave them in vector form. But sometimes only one field needs to be modified, and requiring it to load both, extract both, regather the vector, then store both would be less efficient than doing a direct-to-memory write of the single element when the vector doesn't need to stay in register for arithmetic. It could be implemented by using setentry to create a vector construction that grabs every field from the existing vector to build the new vector with the one substituted for the new value.

elliottslaughter commented 5 years ago

Right, so this doesn't seem like a good use case for vectors since if you want a 3D point there is no machine that will offer a vector type with 3 elements. (And even in say, 2D or 4D, those may or may not map to underlying vector type, e.g. on an x86 machine mixing SSE and AVX is very expensive so you don't just want to use them blindly.)

I would use vectors if and only if you're trying to use the underlying machine's vector type, i.e. when you have a conceptually large data structure and you're willing to break it into pieces depending on what the machine's vector size is. For cases where you have a specific, small constant number of elements, use arrays.

ErikMcClure commented 5 years ago

This is not how SSE instructions are used. Realtime rendering engines routinely use SSE instructions for 3D points by either filling the final element with a 0 value or padding the vector and simply loading the uninitialized padded bits into SSE instructions and ignoring the result. 2D compositors sometimes operate on two 2D points at once, loading one into the bottom 2 registers and one into the top 2. Even then, If you're using doubles, SSE instructions only operate on 2 at once, which can still give a significant performance boost for large expressions. This, of course, is just for SSE instructions, AVX instructions are even more complex, but the principle is the same.

While loading values into an SSE register for a single operation and then storing them is usually a pretty bad idea, that's not what this represents. Vectors cannot be stored as just arrays - unaligned loads and stores are much more expensive than aligned loads and stores, so if Terra is going to have a vector class, it's assumed that using a vector instead of an array guarantees 16-byte alignment (or 32-byte for AVX). It also signals intent, because a vector is supposed to use SSE vectorizations for chained operations, otherwise you are relying entirely on the LLVM optimizer to vectorize your operations. It also helps resolve data dependencies and helps with aliasing, because LLVM knows that the elements and independent of each other. One of the only reasons WebAssembly currently can't reach native speeds is precisely because of vectorization information that cannot be reconstructed due to pointer aliasing.

Even worse, Terra itself doesn't actually expose any of these operations, forcing you to either use assembly, declare LLVM intrinsics directly, or use vectors. This means vectors are the high level interface to vectorization, so saying use vectors if and only if you're trying to use the underlying machine's vector type makes no sense. A simple "vector type" cannot even come close to exposing all the SSE operations in a sane manner.

The goal here is that if you have a long chain of operations being done on say, a rectangle in a 2D compositor, you want the chain of operations to be loaded into SSE vectors and then stored once the operation is finished. You can do this manually, by creating an intermediate type representation for arithmetic expressions, but this should really be handled by the optimizer. The optimizer should be working with maximal information, so you would start by declaring a vector of a type, and then if it is not optimal to use SSE instructions on it, lower it to an array and perform the operations manually.

Terra needs to make up it's mind - is vector a low-level or a high-level abstraction? Because right now it's basically a very poor low-level and also a poor high-level abstraction. It has no way of representing precise SSE operations, like a packed horizontal add, but also can't do the high level work of raising expressions to SSE registers during an expression and lowering them to array operations when appropriate.

So Terra can either do the hard work at the high level of knowing when to optimize vectors to SSE or AVX registers, or it can directly expose the underlying operations and let the programmer use it as a low-level interface. Given that this kind of vectorization is something the LLVM optimizer is very good at, provided you tell it what can be vectorized, it seems to me like terra should work on having better high-level support for vectorized operations.

elliottslaughter commented 5 years ago

Hi @blackhole12, I appreciate the perspective. I do think there are some points you're confused on though.

Terra exposes a low-level vector interface: in fact, it exposes LLVM intrinsics to the user. (See e.g.: https://github.com/zdevito/terra/blob/master/tests/sgemm.t#L15) So you can in fact call everything directly (at least everything LLVM knows about).

It does also provide a very, very small amount of higher-level support. And this may be part of what is confusing. This is really just some simple sugar around the lower-level interface, and it's really not that sophisticated.

Your own example is actually a great demonstration of the compromises, and the care, that has to be taken to work with a vector type correctly (e.g. padding the elements of the vector to get it to fit the machine's vector size). You mention alignment, but actually this isn't fixed by using a vector type because e.g. malloc is outside of Terra's control and will not align AVX vectors correctly by default. And you didn't even address portability: if things are ugly handling just SSE and AVX, how much more so if you want to also support Altivec for PPC and Neon for ARM? This gets hairy really quickly. So using vectors correctly takes a fair bit of care (even more so when you need intrinsics), which is why I'm not sure a high-level interface is really viable.

My own perspective is that higher-level interfaces tend to be brittle. It's easy to have code that works, until something you do confuses the alias analysis pass, and then all the optimizations start failing. I don't necessarily object to improving the high-level aspect of Terra's vector interface, but I'm reluctant to add features that will be difficult to optimize or prone to making the optimizer fail.

If you or someone wants to suggest a better higher-level interface, I'm happy to consider those, but probably where we'd want to start is with an existing prototype so that we know things actually work. In theory this should be able to be developed on top of Terra's existing low-level vector support.

Anyway, as for the issue originally under discussion, it's not clear to me that a straightforward implementation of this would get optimized the way we want. It seems like exactly the sort of brittle operation that will be sensitive to what LLVM understands, and if you tweak your code in just the wrong way you'll get bad performance. I'm happy to be proven wrong but we're pretty much at the mercy of LLVM here so I'm not sure how we could be sure this would always optimize correctly.

aiverson commented 5 years ago

One of the main stated goals of Terra is to be able to write performance portable code efficiently.

If making the code performant requires manually writing a binding of the vector instructions for every supported platform, that is a problem.

Does Terra have any convenient way to make the code depend on the target architecture so that a local vectorized operation can be conditionalized at compile time for which platform it is to either generate a vectorized version with the platform specific intrinsic implementation, or to fall back to nonvectorized implementations? It can depend on the native architecture via terralib.nativetarget, but that isn't as good as being able to target vectorizations to the actual target for things like compiling a function to both CPU and GPU.

Because if Terra is going to try to be performance portable, there needs to be a way to conveniently adjust to a specific architecture during build, a way to expose doing custom optimization passes locally, a higher level vector interface that is at least no worse than doing it without vectorization, or something.

This relates to a question I asked on the zulipchat a while ago about custom operator fusion.

I can try to write a testbed implementation of this kind of thing, but my backlog is already very long and I won't get to it for a while unless something changes in priority.

elliottslaughter commented 5 years ago

As far as I'm aware, this is an open research problem, and I'm not sure anyone has a perfect solution at the moment. Though it would be good to do a literature review to actually confirm that, and not take my word for it.

The good news is, I think this can evolve in a library. Between exotypes and Terra's existing support for low-level vectorization, all the building blocks are there. If we get to the point where we're happy with what we've got, we can start to think about pulling pieces of it back into either Terra itself or the standard library.

It's not personally a high priority for me but I'm happy to interact with anyone who wants to work on this.

ErikMcClure commented 5 years ago

While this would need to be implemented as a library, it seems there has been significant progress recently in making portable vectorized operations: https://arxiv.org/pdf/1902.02816.pdf

This suggests that it is, in fact, at least theoretically possible to implement high level vector support that is portable and effective. Should someone attempt to take this on, a separate issue should probably be raised to track it.