Performance: Pallene checks in for-loops are preventing good compiler optimizations from being applied (loop unrolling & autovectorization)

This is a bit of a follow up to this other perfomance thread I posted at https://github.com/pallene-lang/pallene/discussions/561.

I've been looking a little deeper into the compiler's ability to unroll loops and autovectorize Pallene generated C code. For context, the era of instruction level parallelism, aka SIMD, is already well upon us, and this is one of the few ways to harvest out-sized gains on CPUs that are no longer ramping up clock speeds. This trend only seems to be continuing. So we should be on the constant look out for ways to take advantage of the hardware, when there are opportunities (especially if the C compiler/optimizer is willing to do most of the hardwork for us if we just do the "right" things.)

TLDR: I believe the checks generated in Pallene for-loops are preventing the compiler/optimizer from generating much better code, and thus sacrificing non-trivial amounts of performance gains we otherwise might be able to get for free. I'm hoping we can find a way to change things, such that the compiler can apply its much better optimizations.

So I experimented with a very simple for-loop that iterates through an array of doubles and adds up all the values in the array. I wrote 2 reference programs in C, and a progam in Pallene.

The first reference program is the most basic, ideal C program that iterates through an array of doubles.
The second progam is the same as above, except that instead of an array of doubles, it is an array of structs:
```
struct ContainerElement
{
double number;
int type;
};
```

This is my simplified interpretation of what the internal Lua array looks like. (The real Lua uses a union as the first field, so this is simplified to reduce the union down to a double.) So I believe this case should represent the theoretical max of what we hope the emitted Pallene code can achieve.

The third is a Pallene program which I experiment with by either changing the compiler optimization flags, or by directly hand-editing the generated C code to try to get better performance.

In all the results, the "loop time" is the important part. Startup time doesn't count.

Base C program:


$ gcc -Wall -O3 -ftree-vectorize -funroll-loops base.c 
$ time ./a.out 
startup time (doesn't count)    1.937357
result: 576460751296790656.000000
c loop time 0.824896

real 0m3.035s user 0m1.311s sys 0m1.723s


3. Container C program:

$ gcc -O3 -ftree-vectorize -funroll-loops container.c $ time ./a.out startup time (doesn't count) 3.736625 result: 576460751296790656.000000 c loop time 0.843595

real 0m5.075s user 0m1.905s sys 0m3.170s


5. Pallene:

3a. pallenec, no extra optimizations

$ pallenec basic_pln.pln $ time lua basic_lua.lua startup time (doesn't count) 17.048125 result: 5.7646075237053e+17 pln loop time 1.068943

real 0m19.432s user 0m15.983s sys 0m3.443s


3b. pallenec with CFLAGS:

$ export CFLAGS='-O3 -ftree-vectorize -funroll-loops' $ pallenec basic_pln.pln $ time lua basic_lua.lua startup time (doesn't count) 17.031442 result: 5.7646075237053e+17 pln loop time 1.063149

real 0m19.425s user 0m15.915s sys 0m3.506s


3c. hand modified --emit-c file removing both checks inside for-loop:

// pallene_renormalize_array(L, x1, x4, PALLENE_SOURCE_FILE, 9); / if (l_unlikely(!ttisfloat(slot))) { pallene_runtime_tag_check_error(L, "basic_pln.pln", 9, LUA_VNUMFLT, rawtt(slot), "array element"); } /

$ pallenec --emit-c basic_pln.pln $ cp basic_pln.c basic_pln_handmodify.c $ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c $ gcc -shared -fpic -o basic_pln.so basic_pln.o $ time lua basic_lua.lua startup time (doesn't count) 17.113667 result: 5.7646075237053e+17 pln loop time 0.84382

real 0m19.298s user 0m15.777s sys 0m3.520s


3d. hand modified --emit-c, removing only first check:

// pallene_renormalize_array(L, x1, x4, PALLENE_SOURCE_FILE, 9); $ pallenec --emit-c basic_pln.pln $ cp basic_pln.c basic_pln_handmodify.c $ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c $ gcc -shared -fpic -o basic_pln.so basic_pln.o $ time lua basic_lua.lua startup time (doesn't count) 17.080507 result: 5.7646075237053e+17 pln loop time 0.928175

real 0m19.333s user 0m15.920s sys 0m3.412s


3e. hand modified --emit-c, removing only second check:

/ if (l_unlikely(!ttisfloat(slot))) { pallene_runtime_tag_check_error(L, "basic_pln.pln", 9, LUA_VNUMFLT, rawtt(slot), "array element"); } / $ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c $ gcc -shared -fpic -o basic_pln.so basic_pln.o $ time lua basic_lua.lua startup time (doesn't count) 17.226762 result: 5.7646075237053e+17 pln loop time 1.066411

real 0m19.626s user 0m16.137s sys 0m3.488s



**To sum up**, _**after I remove both checks**_ inside the for-loop in the Pallene generated code, **_we get performance that matches the pure C counterpart._** **_But leaving in the checks, the optimizer is not able to unroll and vectorize._**
So we are looking at **_~23% slow down caused by the two checks inside the loop_**.

*Additional note: I didn't benchmark with more aggressive processor optimization flags, such as -mavx2, -mavx512f, but I did look at the generated assembly, and you can definitely see an impactful change in how it uses the SIMD registers. I think gcc by default is using something that looks like -msse2 on my computer. This makes sense because every x86-64 CPU can do SSE2.

Some possible ideas to improve this:

1. Add a compiler optimization flag (e.g. "--unchecked-loops") that removes these checks. This would be only for users who know what they are doing to opt into.

2. Is there a way we can generate these checks that don't cause the compiler to give up on the optimizations?

    a. Assuming there isn't a different way to write the same exact behavior that doesn't cause the optimizer to give up, one idea is that maybe for the check, we just keep track of a "found_an_error" boolean through the loop. Then after the end of the loop, we check the flag, and then outside the loop, we finally trigger the error. It comes a little late, but maybe this can be part of an optional compiler flag.
    b. Additionally, with this deferred error idea, maybe we only apply it to certain types that are less likely to cause dangerous errors until the end of the loop. For example, floats and integers might be able to defer the error, but other types we always keep the checks. The idea is that the program will eventually trigger an error anyway, so garbage numbers won't hopefully be able to do real damage, especially with an opted-in compiler flag.

3. Is there a way we can hoist these checks outside the loop?
    a. Is there something clever we can do before or after the loop to check?
    b. Maybe we can iterate through the loop twice. The first time is just the checks. The second time is the code without the checks. We'll need to benchmark, but the hope/presumption is that the performance gains won from the optimizer will vastly outweight the cost of looping through the array twice.

6. This will probably require cooperation with Lua, so down the line, if Lua decides Pallene will be an official feature moving forward, maybe an internal flag can be set somewhere that tracks whether an array is homogeneous or not. If the array is known to be homogeneous, then checks can be skipped. If Lua adds or changes a value, which results in a heterogenous array, it can set a flag that Pallene can read. This can allow Pallene to choose between two different code paths, with-checks and without-checks. Presumably, once Lua determines the array is no longer homogeneous, it can't unset the flag unless the array is emptied (or down to 1 element).

7. Pallene compiler/optimizer manually unrolls loops itself. Rather than completely relying on the underlying C compiler, Pallene could attempt to unroll some loops itself and try to generate output that might make it easier for the C autovectorizer to find operations it can combine into SIMD registers.
Most modern CPUs (both desktop and mobile) support at least 128-bit wide instructions. So assuming Pallene/Lua is using 64-bit ints and floats, I think an unrolling of 2 might be sufficient. Some flexibility for more powerful architectures or Lua compiled with using 32-bit sizes or just performance experimentation, to unroll more loops (e.g. 4 or 8 or 16) might be a nice to have (controlled by compiler flag options). Even if the C compiler isn't able to fully vectorize, sometimes the loop unrolling itself gives a big enough performance boost by itself to make it worth it. Since Lua is also used for embedded hardware which may benefit more from small code size and lack SIMD, compiler flags to turn on/off unrolling is desirable.

I'm attaching my example files, plus the assembly output of the Pallene code, with and without the checks removed.

You can use the -S flag to generate the assembly output.
https://godbolt.org is also great, and I recommend pasting in the C programs there. (The Pallene program is harder to use with Compiler Explorer because of the extra headers needed for lua.h.)

[UNROLL_VEC.zip](https://github.com/pallene-lang/pallene/files/10926240/UNROLL_VEC.zip)

Wow, these are very interesting numbers. Thanks for measuring them!

One thing I want to add to Pallene is a separate array datatype that's backed by a C buffer instead of by a Lua table. Similarly to how we have two kinds of records: Pallene records implemented as userdata and Lua records implemented as Lua tables.

A separate type for arrays would avoid run-time tag checks and would also use less RAM (which also speeds things up). The catch is that such arrays cannot be directly manipulated from Lua. If Lua wants to read or write to these arrays it would need to invoke a method or metamethod. What do you think? Would this do what you're looking for?

Add a compiler optimization flag that removes these checks

I'd rather not make Pallene unsafe; it's asking for trouble.

Is there a way we can hoist these checks outside the loop?

Maybe! In theory we could split things into two loops. First we check all the type tags and then the second time around we can speed through the loop without any type checks. However, it's possible that having to go through the array twice will cost more than going through it only once.

maybe an internal flag can be set somewhere that tracks whether an array is homogeneous or not.

Cool idea! And it might be more practical than you think. Roberto told me he's considering changing the implementation of arrays so that type tags are stored separately from the values. Instead of TVTVTVTV, it would be TTTVVVV. (The current implementation wastes some RAM due to padding, because tags are smaller than values). If Lua ends up doing that, maybe we could represent homogenous arrays as just TVVVVVVV, which would be even more of a space save.

Pallene compiler/optimizer manually unrolls loops itself

That's an interesting theory. If you would be willing to test some manual loop unrolling, it would be valuable info.

And it might be more practical than you think

Thinking a bit more, there are some challenges due to tag variants. Lua uses a 8-bit tag with 4 bits for the type and 4 bits for the "variant". Although lists often have an homogenous type, often the variants might differ. For example, boolean true and false are two different variants, as are Lua funciton closures vs C function closures. Another tricky case are arrays with optional or missing values (nil or false). But maybe there's a way to do it...

I meant to respond a week ago, but after doing a bunch of experiments, it took me way longer to clean everything up and write up the results:

Trying 2 separate loops (first check, then work): This yielded poor results. Unfortunately, I don't think this will be viable.


real    0m24.924s
user    0m16.623s
sys 0m5.808s
$ gcc -O3 -funroll-loops -ftree-vectorize -msse2  -fpic -c -o basic_pln.o basic_pln_handmodify_separateloops.c 
$ gcc  -shared -fpic -o basic_pln.so basic_pln.o 
$ time lua basic_lua.lua 
startup time (doesn't count)    17.327992
result: 5.7646075237053e+17
pln loop time   1.866526

real 0m20.586s user 0m17.003s sys 0m3.565s


2. Manual unrolling

This didn't help as much as I hoped. You can see the manual unroll 2 is not much different, but the unroll 4 does get a small performance boost. But my conclusion from this is that the checks themselves are the real bottleneck. Removing those checks yields a ton more speed, and seems to also remove the need to manually unroll the loops since the compiler seems to do it for us.
I think manual loop unrolling may be worth a future revisit down the road when looking for additional performance gains, but for now, I feel the lesson is we should be looking at how to remove these checks from inside the loops.

$ gcc -O3 -funroll-loops -ftree-vectorize -msse2 -fpic -c -o basic_pln.o basic_pln_handmodify_unroll2.c $ gcc -shared -fpic -o basic_pln.so basic_pln.o $ time lua basic_lua.lua startup time (doesn't count) 17.221831 result: 5.7646075129679e+17 pln loop time 1.058078

real 0m19.631s user 0m16.092s sys 0m3.535s

$ gcc -O3 -funroll-loops -ftree-vectorize -msse2 -fpic -c -o basic_pln.o basic_pln_handmodify_unroll4.c $ gcc -shared -fpic -o basic_pln.so basic_pln.o $ time lua basic_lua.lua startup time (doesn't count) 17.402829 result: 5.7646074914931e+17 pln loop time 1.007499

real 0m19.788s user 0m16.286s sys 0m3.483s


3. I did play around with a few different ideas inlining the checks, not inlining the checks, simplifying the checks (including just setting a boolean to be checked at the end of the loop). The problem seems to be that the conditionals themselves slow down the loops considerably and there doesn't seem to be a way to mitigate that.
I also did take note that you used __builtin_expect. I admit I was hoping this would make these checks a freebee, but in reality, I think all they can do is guide the compiler from generating even slower code.

P.S. I *think* pallene_renormalize_array can be simplified from:

static void pallene_renormalize_array( lua_State L, Table arr, lua_Integer i, const char* file,int line ){ lua_Unsigned ui = (lua_Unsigned) i - 1; if (l_unlikely(ui >= arr->alimit)) { pallene_grow_array(L, file, line, arr, ui); } }

to:

inline static void pallene_renormalize_array( lua_State L, Table arr, lua_Integer i, const char* file,int line ){ if (l_unlikely((lua_Unsigned)i > arr->alimit)) { pallene_grow_array(L, file, line, arr, (lua_Unsigned)i-1); } }



I did not see any noticeable impact on performance with this change. It was just one of the things that I tried when trying to manually inline to make things shorter. Thought I would mention it though.

- A. Dedicated Pallene arrays 
In general, I think I like the idea and was something I was thinking about asking for down the road, trying to think through some of the issues (I wrote notes in another document somewhere, which I thought I might eventually share in the future if think the ideas are viable), but there are some high-level caveats:

1. First, a big pain point for me right now with Records is that there is no way to share them across different Pallene modules.
As a simplied example, I might have a common "Point" record with floats of x,y,z. I have a ton of different functions that do things with Points, but for various good reasons, I don't want them in a giant single module. Right now, there isn't a good way to do this.
I kind of want this record problem solved, and don't want to have a repeat problem with the native array type.

2. I actually have not been using records much, because for my use case, I want arrays of records, not just individual records. But because I think I know how this will be laid out in memory (array of pointers, and thus pointer chasing for every element), I have been avoiding them, and keeping multiple arrays of floats for everything. My APIs kind of suck because of that, but I at least can reason about and get reasonable performance.
My hopes would be a native Pallene array would allow for actual contigous arrays of structs, and thus alleviate my performance concerns. If so, this would be a welcome thing. (I try working out some of this in my notes.)

2b. A C API to interact with these Pallene arrays would be very desirable. I would love the ability to say, memcpy, between C and Pallene arrays to transfer data between them really fast. And in extreme cases, being able to get access to the array address in C and know the size and offsets so it can be directly read or written to would be amazing. (My thinking about such a feature is somewhat influenced by how we copy memory buffers from the CPU to the GPU in APIs like OpenGL.)

3. Being able to access Pallene arrays transparently in Lua, and without a major performance hit (i.e. avoid the C/FFI hit we get today with the standard C/Lua API), is highly desirable. I know there might be trade-offs that have to be made, but I really think mostly transparent access between Lua/Pallene is important. Perhaps we can be more restrictive with the arrays of structs, but for arrays of primitives like numbers, I think transparent access becomes more desirable.

3a. Already, because of my previous other crashing bugs, I am frequently switching between Pallene and Lua. If my code had to care about the array type and wasn't transparent, I would be in big trouble.

3b. Particularly for arrays of primitives, it is very nice working with others who are not advanced coders, and only deal with Lua. Allowing them to write code as if it was just a normal float array in Lua makes things so easy.

3c. Similarly, it is nice to be able to pull off the shelf existing functions/modules that operate on regular Lua arrays (of numbers). 

3d. My hope is for at least arrays of primitives, you can both provide the index accessor metamethods, e.g. a[i], and also make these fast. My take-aways from the Lua/C-API are making C-arrays accessible to Lua with index accessors is really slow to cross the FFI bridge. So in the end, we lose more performance than we gain if the user is frequently accessing individual elements on the scripting-side. The problem is this message is not conveyed well in general, and entire eco-systems sometime get built-up around complete misunderstnadings of how things work and hand-wavy beliefs that "C/native is always faster". The Lua communitiy has been better at this than most, 
but I can pick on Python easily here since one of the things I'm writing in Pallene is most popularly done in Python. So far my Pallene version is hundreds, if not thousands, times faster than the usual Python counterpart. This not because Pallene is necessarily fast (but I hope to get there), but because Python is *so* slow. 

Most people using Python don't really understand what's going on under the hood and they cling to tropes that if they use X (substitute X, with NumPy, Pandas, PyPy, C module, etc.), for all their arrays, everything will get magically fast (which it doesn't).
They completely miss the nuiances and important details about how things actually work, for instance, that they are not supposed to access individual elements of the arrays in Python if they are using NumPy/Pandas/C arrays because the FFI overhead kills all the gains they hoped to get by using these in the first place.
And due to the nature of these programs, indivdual element access in Python is *always* done, and frequently, and in loops. And this trope/belief is so ingrained now, that you can't tell anybody differenty.
And from this, there is almost a religion that has formed around Python, because that to make anything fast, you must use these crazy libaries like NumPy/Pandas/etc., so you should never use another language because you won't get access to these kinds of libraries in other langauges, and they are too hard to build. They have in some sense, created a programming tautology, that condemns everybody to use Python and can't get anything faster ("unless you rewrite everything in C++ which is too hard"). And it's a real shame that the scientific community has been writing so much basic math stuff in Python, and been subject to all this non-sensicle slowness.

So with Pallene, I really hope we can make the plain and simple for-loop on arrays of numbers as fast as what you would expect with an optimizing C compiler on the backend. This will avoid recreating all the strange mystiscism seen in Python about needing magical libraries, writing C modules, or needing JIT, to make things fast.
And maybe with Pallene, many scientists will wise up and discover that they can magnitudes faster performance with the simplest, most straightforward code, instead of needing to use layers upon layers of libraries in Python.

So back to Pallene/Lua arrays, so what I hope you can do is use your private Pallene-internal access to Lua-internals to make individual element access on a Pallene array about on par with a regular Lua array.
Then we can avoid the Python nightmare entirely, and provide the world with an option that doesn't compromise simple with fast.
So I think it is fine to limit/restrict customizability to make these as fast as possible, i.e., people can't define custom metamethods for Pallene arrays, if this helps performance.

(I think it is reasonable to treat arrays of records differently than arrays of primitives, or even arrays of numbers, if this can't easily be extended universally.)

3e. This is my first time using LuaLanes. Lanes has limitations about how it handles userdata across lanes, and requires people to implement a custom clone metamethod for their userdata if they want it to be handled by Lanes. Because Lanes is popular enough, I think some modules implement this metamethod. But this made me really appreciate how I was using native Lua data types with Pallene and didn't have to worry about this. It's not to say I don't want better arrays for Pallene, but this is one of those subtle trade-offs I've only recently come to appreciate. (And I can already see the desire for "Pallene Power Patches" to add the clonable metamethod to Pallene arrays so thay can be copied across Lanes :P)

Bottom line is I think these checks inside the for-loop have an outsized negative performance impact, and we should try to find ways to hoist them out of the loops. I believe it is reasonable to assume:
1. If somebody is using Pallene, they want performance.
2. If there is a loop, it is potentially a hot loop.
3. If there is a loop, it could also be an inner loop.

So Pallene loops are a potential source of magnified gains or losses, since these are one of the key places where modern CPUs are designed to extract outsized gains if you use them right.
Thus, I think amortizing checks at insertion or removal with a stored state flag, instead of checking at every read, may be worth it.

One psuedo-algorithm I was thinking of, was that I presume the Lua-internals have a common function point for adding or changing an element in the array part. If so, then if we create a homogeneous bit flag somewhere to track the state, then this can be used exclusively for true homogenity (and thus weed out the subvariants). I'm also just focusing on the "array part" of the Lua table. My thesis is we only care about the array part.

if adding_to_array_part
{
    if array_was_homogeneous and (new_element_type == previous_element_type)
    {
        // still homogenous
    }
    else
    {
        set_not_homogeneous();
    }
}
elseif replacing_array_element
{
    if existing_element_type == new_element_type
    {
        // still homogeneous
    }
    else
    {
        set_not_homogeneous();
    }
}

if removing_array_element
{
    // I don't know what the Lua implemenatation does about holes, or how Pallene is supposed to deal with them
    if makes_a_hole
    {
        set_not_homogeneous();
        do_extra_pallene_prep_stuff_to_deal_with_holes_now_if_possible();
    }
    // also could be 0
    elseif element_count == 1   
    {
        set_homogeneous();
    }
}

Also, for pointer types, maybe the checks aren't as bad since there will be pointer chasing which will probably kill performance anyway. But for numerical arrays/loops, I presume this is where we can get outsized performance gains if we keep the code lean.

I am attaching more source code examples which include some of the experiments I ran.
Also, there are a couple of special files with the *_for_godbolt.c suffix.
I copy and pasted all the Lua headers into these files so you can paste them into Compiler Explorer, so you can easily view the assembly generation.

As a reminder, in Compiler Explorer, set the compiler flags appropritately. As a starting point:
```-O3 -funroll-loops -ftree-vectorize```

Also
-msse2 (default)
or
-mavx2

can be interesting.

Also switching to arm64, I believe automatically utilizes NEON SIMD instructions.

[SET2.zip](https://github.com/pallene-lang/pallene/files/11010055/SET2.zip)

I meant to respond a week ago, but after doing a bunch of experiments, it took me way longer to clean everything up and write up the results:

Sorry, I also had to wait a week to replay :) I was really busy moving to my new apartment and this discussion deserved a well-reasoned response.

Trying 2 separate loops (first check, then work): This yielded poor results. Unfortunately, I don't think this will be viable.

Is it 24 seconds for the two loops (about 12 seconds per loop)? If so, it would be an impressive speedup regardless, albeit not the 2x speedup we'd need to blindly apply this code transformation everywhere.

2. Manual unrolling
This didn't help as much as I hoped. You can see the manual unroll 2 is not much different, but the unroll 4 does get a small performance boost. But my conclusion from this is that the checks themselves are the real bottleneck. Removing those checks yields a ton more speed, and seems to also remove the need to manually unroll the loops since the compiler seems to do it for us. I think manual loop unrolling may be worth a future revisit down the road when looking for additional performance gains, but for now, I feel the lesson is we should be looking at how to remove these checks from inside the loops.

Had the compiler been able to vectorize the loop, I would have expected a larger performance boost. Maybe it's still not being able to vectorize it? Perhaps there is a way to refactor those checks so the loop can be vectorized. That is, some way to check all 4 checks at once instead of doing four checks in sequence. However, there's a good chance that this would only be possible with compiler vector intrinsics because because a complex loop body might be too much for the auto-vectorizer.

3. I did play around with a few different ideas inlining the checks, not inlining the checks, simplifying the checks (including just setting a boolean to be checked at the end of the loop). The problem seems to be that the conditionals themselves slow down the loops considerably and there doesn't seem to be a way to mitigate that.
   I also did take note that you used __builtin_expect. I admit I was hoping this would make these checks a freebee, but in reality, I think all they can do is guide the compiler from generating even slower code.

Typically __builtin_expect doesn't make a big difference. However, __builtin_unreachable and __attribute__(noreturn) sometimes do. Pay attention to those. In theory our error functions should have been annotated with them though so I don't see what we could change.

P.S. I think pallene_renormalize_array can be simplified from: lua_Unsigned ui = (lua_Unsigned) i - 1;

At a first glance I would expect that the compiler would produce identical code. Am I missing something here?

* A. Dedicated Pallene arrays
  In general, I think I like the idea and was something I was thinking about asking for down the road, trying to think through some of the issues (I wrote notes in another document somewhere, which I thought I might eventually share in the future if think the ideas are viable), but there are some high-level caveats:

1. First, a big pain point for me right now with Records is that there is no way to share them across different Pallene modules.

The lack of module system bites us again. This is not an all a fundamental limitation of Pallene, just that we haven't had the time to properly implement this feature. (If you want to help with this, it would be very much appreciated; someone who's already writing multi-file pallene programs should have very useful insights about how the module system should work)

2. I actually have not been using records much, because for my use case, I want arrays of records, not just individual records. But because I think I know how this will be laid out in memory (array of pointers, and thus pointer chasing for every element), I have been avoiding them, and keeping multiple arrays of floats for everything. My APIs kind of suck because of that, but I at least can reason about and get reasonable performance.>     2. I actually have not been using records much, because for my use case, I want arrays of records, not just individual records. But because I think I know how

I feel that this in an argument in favor of at least adding support for arrays of simple non-pointer types, as a first step.

   My hopes would be a native Pallene array would allow for actual contigous arrays of structs, and thus alleviate my performance concerns. If so, this would be a welcome thing. (I try working out some of this in my notes.)

Let me know! My main question is whether special arrays fit better as an FFI thing (because one might want to use arrays of structs made by C) or as a core part of Pallene language (like records).

2b. A C API to interact with these Pallene arrays would be very desirable. I would love the ability to say, memcpy, between C and Pallene arrays to transfer data between them really fast. And in extreme cases, being able to get access to the array address in C and know the size and offsets so it can be directly read or written to would be amazing. (My thinking about such a feature is somewhat influenced by how we copy memory buffers from the CPU to the GPU in APIs like OpenGL.)

Good point. Although it sounds more like a Pallene API to interact with C arrays.

3. Being able to access Pallene arrays transparently in Lua, and without a major performance hit (i.e. avoid the C/FFI hit we get today with the standard C/Lua API), is highly desirable. I know there might be trade-offs that have to be made, but I really think mostly transparent access between Lua/Pallene is important. Perhaps we can be more restrictive with the arrays of structs, but for arrays of primitives like numbers, I think transparent access becomes more desirable.

You hit the nail on the head! This might be the biggest trade-offs between Lua tables and malloc-ed arrays. With records we do not expose an __index metamethod, so the Lua side is warned that accessing a field will cost a function call.

3a. Already, because of my previous other crashing bugs, I am frequently switching between Pallene and Lua. If my code had to care about the array type and wasn't transparent, I would be in big trouble.

3b. Particularly for arrays of primitives, it is very nice working with others who are not advanced coders, and only deal with Lua. Allowing them to write code as if it was just a normal float array in Lua makes things so easy.

3c. Similarly, it is nice to be able to pull off the shelf existing functions/modules that operate on regular Lua arrays (of numbers).

For sure. But we do have to live with the fact that some optimizations are harder to do on Lua arrays...

3d. My hope is for at least arrays of primitives, you can both provide the index accessor metamethods, e.g. a[i], and also make these fast. My take-aways from the Lua/C-API are making C-arrays accessible to Lua with index accessors is really slow to cross the FFI bridge. So in the end, we lose more performance than we gain if the user is frequently accessing individual elements on the scripting-side.

I'm afraid that fast metamethods might be impossible. The issue is less about the FFI bridge and more about the metamethod call itself. A Lua function call to implement the metamethod is more expensive than a raw GETINDEX operation.

The problem is this message is not conveyed well in general, and entire eco-systems sometime get built-up around complete misunderstnadings of how things work and hand-wavy beliefs that "C/native is always faster". The Lua communitiy has been better at this than most, but I can pick on Python easily here since one of the things I'm writing in Pallene is most popularly done in Python. So far my Pallene version is hundreds, if not thousands, times faster than the usual Python counterpart. This not because Pallene is necessarily fast (but I hope to get there), but because Python is so slow.

So with Pallene, I really hope we can make the plain and simple for-loop on arrays of numbers as fast as what you would expect with an optimizing C compiler on the backend. This will avoid recreating all the strange mystiscism seen in Python about needing magical libraries, writing C modules, or needing JIT, to make things fast. And maybe with Pallene, many scientists will wise up and discover that they can magnitudes faster performance with the simplest, most straightforward code, instead of needing to use layers upon layers of libraries in Python. So back to Pallene/Lua arrays, so what I hope you can do is use your private Pallene-internal access to Lua-internals to make individual element access on a Pallene array about on par with a regular Lua array.

In theory, Pallene for loops can be fast when accessing a Lua table and even faster if we add a custom array type. Lua loops will be regular speed for Lua tables but much slower for a custom array type.

3e. This is my first time using LuaLanes. Lanes has limitations about how it handles userdata across lanes, and requires people to implement a custom clone metamethod for their userdata if they want it to be handled by Lanes.

Noted.

Bottom line is I think these checks inside the for-loop have an outsized negative performance impact, and we should try to find ways to hoist them out of the loops. I believe it is reasonable to assume:

This is a laudable goal, although for now I'm not sure if it's something we can do...

So Pallene loops are a potential source of magnified gains or losses, since these are one of the key places where modern CPUs are designed to extract outsized gains if you use them right. Thus, I think amortizing checks at insertion or removal with a stored state flag, instead of checking at every read, may be worth it.

One problem is that if at any point the array "escapes" or we call a Lua function, then we can no longer optimize the tag checks because Lua might have modified the array behind our backs. This can make the optimization fragile, because small changes to the program might prevent the optimization from kicking in.

One psuedo-algorithm I was thinking of, was that I presume the Lua-internals have a common function point for adding or changing an element in the array part. If so, then if we create a homogeneous bit flag somewhere to track the state, then this can be used exclusively for true homogenity (and thus weed out the subvariants). I'm also just focusing on the "array part" of the Lua table. My thesis is we only care about the array part.

Just to be clear, this change would be inside the Lua interpreter or inside the code emitted by the Pallene compiler?

Also, for pointer types, maybe the checks aren't as bad since there will be pointer chasing which will probably kill performance anyway. But for numerical arrays/loops, I presume this is where we can get outsized performance gains if we keep the code lean.

Right.

pallene-lang / pallene

Performance: Pallene checks in for-loops are preventing good compiler optimizations from being applied (loop unrolling & autovectorization) #564