Open ewmailing opened 1 year ago
Wow, these are very interesting numbers. Thanks for measuring them!
One thing I want to add to Pallene is a separate array datatype that's backed by a C buffer instead of by a Lua table. Similarly to how we have two kinds of records: Pallene records implemented as userdata and Lua records implemented as Lua tables.
A separate type for arrays would avoid run-time tag checks and would also use less RAM (which also speeds things up). The catch is that such arrays cannot be directly manipulated from Lua. If Lua wants to read or write to these arrays it would need to invoke a method or metamethod. What do you think? Would this do what you're looking for?
Add a compiler optimization flag that removes these checks
I'd rather not make Pallene unsafe; it's asking for trouble.
Is there a way we can hoist these checks outside the loop?
Maybe! In theory we could split things into two loops. First we check all the type tags and then the second time around we can speed through the loop without any type checks. However, it's possible that having to go through the array twice will cost more than going through it only once.
maybe an internal flag can be set somewhere that tracks whether an array is homogeneous or not.
Cool idea! And it might be more practical than you think. Roberto told me he's considering changing the implementation of arrays so that type tags are stored separately from the values. Instead of TVTVTVTV, it would be TTTVVVV. (The current implementation wastes some RAM due to padding, because tags are smaller than values). If Lua ends up doing that, maybe we could represent homogenous arrays as just TVVVVVVV, which would be even more of a space save.
Pallene compiler/optimizer manually unrolls loops itself
That's an interesting theory. If you would be willing to test some manual loop unrolling, it would be valuable info.
And it might be more practical than you think
Thinking a bit more, there are some challenges due to tag variants. Lua uses a 8-bit tag with 4 bits for the type and 4 bits for the "variant". Although lists often have an homogenous type, often the variants might differ. For example, boolean true and false are two different variants, as are Lua funciton closures vs C function closures. Another tricky case are arrays with optional or missing values (nil or false). But maybe there's a way to do it...
I meant to respond a week ago, but after doing a bunch of experiments, it took me way longer to clean everything up and write up the results:
real 0m24.924s
user 0m16.623s
sys 0m5.808s
$ gcc -O3 -funroll-loops -ftree-vectorize -msse2 -fpic -c -o basic_pln.o basic_pln_handmodify_separateloops.c
$ gcc -shared -fpic -o basic_pln.so basic_pln.o
$ time lua basic_lua.lua
startup time (doesn't count) 17.327992
result: 5.7646075237053e+17
pln loop time 1.866526
real 0m20.586s user 0m17.003s sys 0m3.565s
2. Manual unrolling
This didn't help as much as I hoped. You can see the manual unroll 2 is not much different, but the unroll 4 does get a small performance boost. But my conclusion from this is that the checks themselves are the real bottleneck. Removing those checks yields a ton more speed, and seems to also remove the need to manually unroll the loops since the compiler seems to do it for us.
I think manual loop unrolling may be worth a future revisit down the road when looking for additional performance gains, but for now, I feel the lesson is we should be looking at how to remove these checks from inside the loops.
$ gcc -O3 -funroll-loops -ftree-vectorize -msse2 -fpic -c -o basic_pln.o basic_pln_handmodify_unroll2.c $ gcc -shared -fpic -o basic_pln.so basic_pln.o $ time lua basic_lua.lua startup time (doesn't count) 17.221831 result: 5.7646075129679e+17 pln loop time 1.058078
real 0m19.631s user 0m16.092s sys 0m3.535s
$ gcc -O3 -funroll-loops -ftree-vectorize -msse2 -fpic -c -o basic_pln.o basic_pln_handmodify_unroll4.c $ gcc -shared -fpic -o basic_pln.so basic_pln.o $ time lua basic_lua.lua startup time (doesn't count) 17.402829 result: 5.7646074914931e+17 pln loop time 1.007499
real 0m19.788s user 0m16.286s sys 0m3.483s
3. I did play around with a few different ideas inlining the checks, not inlining the checks, simplifying the checks (including just setting a boolean to be checked at the end of the loop). The problem seems to be that the conditionals themselves slow down the loops considerably and there doesn't seem to be a way to mitigate that.
I also did take note that you used __builtin_expect. I admit I was hoping this would make these checks a freebee, but in reality, I think all they can do is guide the compiler from generating even slower code.
P.S. I *think* pallene_renormalize_array can be simplified from:
static void pallene_renormalize_array( lua_State L, Table arr, lua_Integer i, const char* file,int line ){ lua_Unsigned ui = (lua_Unsigned) i - 1; if (l_unlikely(ui >= arr->alimit)) { pallene_grow_array(L, file, line, arr, ui); } }
to:
inline static void pallene_renormalize_array( lua_State L, Table arr, lua_Integer i, const char* file,int line ){ if (l_unlikely((lua_Unsigned)i > arr->alimit)) { pallene_grow_array(L, file, line, arr, (lua_Unsigned)i-1); } }
I did not see any noticeable impact on performance with this change. It was just one of the things that I tried when trying to manually inline to make things shorter. Thought I would mention it though.
- A. Dedicated Pallene arrays
In general, I think I like the idea and was something I was thinking about asking for down the road, trying to think through some of the issues (I wrote notes in another document somewhere, which I thought I might eventually share in the future if think the ideas are viable), but there are some high-level caveats:
1. First, a big pain point for me right now with Records is that there is no way to share them across different Pallene modules.
As a simplied example, I might have a common "Point" record with floats of x,y,z. I have a ton of different functions that do things with Points, but for various good reasons, I don't want them in a giant single module. Right now, there isn't a good way to do this.
I kind of want this record problem solved, and don't want to have a repeat problem with the native array type.
2. I actually have not been using records much, because for my use case, I want arrays of records, not just individual records. But because I think I know how this will be laid out in memory (array of pointers, and thus pointer chasing for every element), I have been avoiding them, and keeping multiple arrays of floats for everything. My APIs kind of suck because of that, but I at least can reason about and get reasonable performance.
My hopes would be a native Pallene array would allow for actual contigous arrays of structs, and thus alleviate my performance concerns. If so, this would be a welcome thing. (I try working out some of this in my notes.)
2b. A C API to interact with these Pallene arrays would be very desirable. I would love the ability to say, memcpy, between C and Pallene arrays to transfer data between them really fast. And in extreme cases, being able to get access to the array address in C and know the size and offsets so it can be directly read or written to would be amazing. (My thinking about such a feature is somewhat influenced by how we copy memory buffers from the CPU to the GPU in APIs like OpenGL.)
3. Being able to access Pallene arrays transparently in Lua, and without a major performance hit (i.e. avoid the C/FFI hit we get today with the standard C/Lua API), is highly desirable. I know there might be trade-offs that have to be made, but I really think mostly transparent access between Lua/Pallene is important. Perhaps we can be more restrictive with the arrays of structs, but for arrays of primitives like numbers, I think transparent access becomes more desirable.
3a. Already, because of my previous other crashing bugs, I am frequently switching between Pallene and Lua. If my code had to care about the array type and wasn't transparent, I would be in big trouble.
3b. Particularly for arrays of primitives, it is very nice working with others who are not advanced coders, and only deal with Lua. Allowing them to write code as if it was just a normal float array in Lua makes things so easy.
3c. Similarly, it is nice to be able to pull off the shelf existing functions/modules that operate on regular Lua arrays (of numbers).
3d. My hope is for at least arrays of primitives, you can both provide the index accessor metamethods, e.g. a[i], and also make these fast. My take-aways from the Lua/C-API are making C-arrays accessible to Lua with index accessors is really slow to cross the FFI bridge. So in the end, we lose more performance than we gain if the user is frequently accessing individual elements on the scripting-side. The problem is this message is not conveyed well in general, and entire eco-systems sometime get built-up around complete misunderstnadings of how things work and hand-wavy beliefs that "C/native is always faster". The Lua communitiy has been better at this than most,
but I can pick on Python easily here since one of the things I'm writing in Pallene is most popularly done in Python. So far my Pallene version is hundreds, if not thousands, times faster than the usual Python counterpart. This not because Pallene is necessarily fast (but I hope to get there), but because Python is *so* slow.
Most people using Python don't really understand what's going on under the hood and they cling to tropes that if they use X (substitute X, with NumPy, Pandas, PyPy, C module, etc.), for all their arrays, everything will get magically fast (which it doesn't).
They completely miss the nuiances and important details about how things actually work, for instance, that they are not supposed to access individual elements of the arrays in Python if they are using NumPy/Pandas/C arrays because the FFI overhead kills all the gains they hoped to get by using these in the first place.
And due to the nature of these programs, indivdual element access in Python is *always* done, and frequently, and in loops. And this trope/belief is so ingrained now, that you can't tell anybody differenty.
And from this, there is almost a religion that has formed around Python, because that to make anything fast, you must use these crazy libaries like NumPy/Pandas/etc., so you should never use another language because you won't get access to these kinds of libraries in other langauges, and they are too hard to build. They have in some sense, created a programming tautology, that condemns everybody to use Python and can't get anything faster ("unless you rewrite everything in C++ which is too hard"). And it's a real shame that the scientific community has been writing so much basic math stuff in Python, and been subject to all this non-sensicle slowness.
So with Pallene, I really hope we can make the plain and simple for-loop on arrays of numbers as fast as what you would expect with an optimizing C compiler on the backend. This will avoid recreating all the strange mystiscism seen in Python about needing magical libraries, writing C modules, or needing JIT, to make things fast.
And maybe with Pallene, many scientists will wise up and discover that they can magnitudes faster performance with the simplest, most straightforward code, instead of needing to use layers upon layers of libraries in Python.
So back to Pallene/Lua arrays, so what I hope you can do is use your private Pallene-internal access to Lua-internals to make individual element access on a Pallene array about on par with a regular Lua array.
Then we can avoid the Python nightmare entirely, and provide the world with an option that doesn't compromise simple with fast.
So I think it is fine to limit/restrict customizability to make these as fast as possible, i.e., people can't define custom metamethods for Pallene arrays, if this helps performance.
(I think it is reasonable to treat arrays of records differently than arrays of primitives, or even arrays of numbers, if this can't easily be extended universally.)
3e. This is my first time using LuaLanes. Lanes has limitations about how it handles userdata across lanes, and requires people to implement a custom clone metamethod for their userdata if they want it to be handled by Lanes. Because Lanes is popular enough, I think some modules implement this metamethod. But this made me really appreciate how I was using native Lua data types with Pallene and didn't have to worry about this. It's not to say I don't want better arrays for Pallene, but this is one of those subtle trade-offs I've only recently come to appreciate. (And I can already see the desire for "Pallene Power Patches" to add the clonable metamethod to Pallene arrays so thay can be copied across Lanes :P)
Bottom line is I think these checks inside the for-loop have an outsized negative performance impact, and we should try to find ways to hoist them out of the loops. I believe it is reasonable to assume:
1. If somebody is using Pallene, they want performance.
2. If there is a loop, it is potentially a hot loop.
3. If there is a loop, it could also be an inner loop.
So Pallene loops are a potential source of magnified gains or losses, since these are one of the key places where modern CPUs are designed to extract outsized gains if you use them right.
Thus, I think amortizing checks at insertion or removal with a stored state flag, instead of checking at every read, may be worth it.
One psuedo-algorithm I was thinking of, was that I presume the Lua-internals have a common function point for adding or changing an element in the array part. If so, then if we create a homogeneous bit flag somewhere to track the state, then this can be used exclusively for true homogenity (and thus weed out the subvariants). I'm also just focusing on the "array part" of the Lua table. My thesis is we only care about the array part.
if adding_to_array_part
{
if array_was_homogeneous and (new_element_type == previous_element_type)
{
// still homogenous
}
else
{
set_not_homogeneous();
}
}
elseif replacing_array_element
{
if existing_element_type == new_element_type
{
// still homogeneous
}
else
{
set_not_homogeneous();
}
}
if removing_array_element
{
// I don't know what the Lua implemenatation does about holes, or how Pallene is supposed to deal with them
if makes_a_hole
{
set_not_homogeneous();
do_extra_pallene_prep_stuff_to_deal_with_holes_now_if_possible();
}
// also could be 0
elseif element_count == 1
{
set_homogeneous();
}
}
Also, for pointer types, maybe the checks aren't as bad since there will be pointer chasing which will probably kill performance anyway. But for numerical arrays/loops, I presume this is where we can get outsized performance gains if we keep the code lean.
I am attaching more source code examples which include some of the experiments I ran.
Also, there are a couple of special files with the *_for_godbolt.c suffix.
I copy and pasted all the Lua headers into these files so you can paste them into Compiler Explorer, so you can easily view the assembly generation.
As a reminder, in Compiler Explorer, set the compiler flags appropritately. As a starting point:
```-O3 -funroll-loops -ftree-vectorize```
Also
-msse2 (default)
or
-mavx2
can be interesting.
Also switching to arm64, I believe automatically utilizes NEON SIMD instructions.
[SET2.zip](https://github.com/pallene-lang/pallene/files/11010055/SET2.zip)
I meant to respond a week ago, but after doing a bunch of experiments, it took me way longer to clean everything up and write up the results:
Sorry, I also had to wait a week to replay :) I was really busy moving to my new apartment and this discussion deserved a well-reasoned response.
- Trying 2 separate loops (first check, then work): This yielded poor results. Unfortunately, I don't think this will be viable.
Is it 24 seconds for the two loops (about 12 seconds per loop)? If so, it would be an impressive speedup regardless, albeit not the 2x speedup we'd need to blindly apply this code transformation everywhere.
2. Manual unrolling
This didn't help as much as I hoped. You can see the manual unroll 2 is not much different, but the unroll 4 does get a small performance boost. But my conclusion from this is that the checks themselves are the real bottleneck. Removing those checks yields a ton more speed, and seems to also remove the need to manually unroll the loops since the compiler seems to do it for us. I think manual loop unrolling may be worth a future revisit down the road when looking for additional performance gains, but for now, I feel the lesson is we should be looking at how to remove these checks from inside the loops.
Had the compiler been able to vectorize the loop, I would have expected a larger performance boost. Maybe it's still not being able to vectorize it? Perhaps there is a way to refactor those checks so the loop can be vectorized. That is, some way to check all 4 checks at once instead of doing four checks in sequence. However, there's a good chance that this would only be possible with compiler vector intrinsics because because a complex loop body might be too much for the auto-vectorizer.
3. I did play around with a few different ideas inlining the checks, not inlining the checks, simplifying the checks (including just setting a boolean to be checked at the end of the loop). The problem seems to be that the conditionals themselves slow down the loops considerably and there doesn't seem to be a way to mitigate that. I also did take note that you used __builtin_expect. I admit I was hoping this would make these checks a freebee, but in reality, I think all they can do is guide the compiler from generating even slower code.
Typically __builtin_expect
doesn't make a big difference. However, __builtin_unreachable
and __attribute__(noreturn)
sometimes do. Pay attention to those. In theory our error functions should have been annotated with them though so I don't see what we could change.
P.S. I think pallene_renormalize_array can be simplified from: lua_Unsigned ui = (lua_Unsigned) i - 1;
At a first glance I would expect that the compiler would produce identical code. Am I missing something here?
* A. Dedicated Pallene arrays In general, I think I like the idea and was something I was thinking about asking for down the road, trying to think through some of the issues (I wrote notes in another document somewhere, which I thought I might eventually share in the future if think the ideas are viable), but there are some high-level caveats: 1. First, a big pain point for me right now with Records is that there is no way to share them across different Pallene modules.
The lack of module system bites us again. This is not an all a fundamental limitation of Pallene, just that we haven't had the time to properly implement this feature. (If you want to help with this, it would be very much appreciated; someone who's already writing multi-file pallene programs should have very useful insights about how the module system should work)
2. I actually have not been using records much, because for my use case, I want arrays of records, not just individual records. But because I think I know how this will be laid out in memory (array of pointers, and thus pointer chasing for every element), I have been avoiding them, and keeping multiple arrays of floats for everything. My APIs kind of suck because of that, but I at least can reason about and get reasonable performance.> 2. I actually have not been using records much, because for my use case, I want arrays of records, not just individual records. But because I think I know how
I feel that this in an argument in favor of at least adding support for arrays of simple non-pointer types, as a first step.
My hopes would be a native Pallene array would allow for actual contigous arrays of structs, and thus alleviate my performance concerns. If so, this would be a welcome thing. (I try working out some of this in my notes.)
Let me know! My main question is whether special arrays fit better as an FFI thing (because one might want to use arrays of structs made by C) or as a core part of Pallene language (like records).
2b. A C API to interact with these Pallene arrays would be very desirable. I would love the ability to say, memcpy, between C and Pallene arrays to transfer data between them really fast. And in extreme cases, being able to get access to the array address in C and know the size and offsets so it can be directly read or written to would be amazing. (My thinking about such a feature is somewhat influenced by how we copy memory buffers from the CPU to the GPU in APIs like OpenGL.)
Good point. Although it sounds more like a Pallene API to interact with C arrays.
3. Being able to access Pallene arrays transparently in Lua, and without a major performance hit (i.e. avoid the C/FFI hit we get today with the standard C/Lua API), is highly desirable. I know there might be trade-offs that have to be made, but I really think mostly transparent access between Lua/Pallene is important. Perhaps we can be more restrictive with the arrays of structs, but for arrays of primitives like numbers, I think transparent access becomes more desirable.
You hit the nail on the head! This might be the biggest trade-offs between Lua tables and malloc-ed arrays. With records we do not expose an __index metamethod, so the Lua side is warned that accessing a field will cost a function call.
3a. Already, because of my previous other crashing bugs, I am frequently switching between Pallene and Lua. If my code had to care about the array type and wasn't transparent, I would be in big trouble.
3b. Particularly for arrays of primitives, it is very nice working with others who are not advanced coders, and only deal with Lua. Allowing them to write code as if it was just a normal float array in Lua makes things so easy.
3c. Similarly, it is nice to be able to pull off the shelf existing functions/modules that operate on regular Lua arrays (of numbers).
For sure. But we do have to live with the fact that some optimizations are harder to do on Lua arrays...
3d. My hope is for at least arrays of primitives, you can both provide the index accessor metamethods, e.g. a[i], and also make these fast. My take-aways from the Lua/C-API are making C-arrays accessible to Lua with index accessors is really slow to cross the FFI bridge. So in the end, we lose more performance than we gain if the user is frequently accessing individual elements on the scripting-side.
I'm afraid that fast metamethods might be impossible. The issue is less about the FFI bridge and more about the metamethod call itself. A Lua function call to implement the metamethod is more expensive than a raw GETINDEX operation.
The problem is this message is not conveyed well in general, and entire eco-systems sometime get built-up around complete misunderstnadings of how things work and hand-wavy beliefs that "C/native is always faster". The Lua communitiy has been better at this than most, but I can pick on Python easily here since one of the things I'm writing in Pallene is most popularly done in Python. So far my Pallene version is hundreds, if not thousands, times faster than the usual Python counterpart. This not because Pallene is necessarily fast (but I hope to get there), but because Python is so slow.
So with Pallene, I really hope we can make the plain and simple for-loop on arrays of numbers as fast as what you would expect with an optimizing C compiler on the backend. This will avoid recreating all the strange mystiscism seen in Python about needing magical libraries, writing C modules, or needing JIT, to make things fast. And maybe with Pallene, many scientists will wise up and discover that they can magnitudes faster performance with the simplest, most straightforward code, instead of needing to use layers upon layers of libraries in Python. So back to Pallene/Lua arrays, so what I hope you can do is use your private Pallene-internal access to Lua-internals to make individual element access on a Pallene array about on par with a regular Lua array.
In theory, Pallene for loops can be fast when accessing a Lua table and even faster if we add a custom array type. Lua loops will be regular speed for Lua tables but much slower for a custom array type.
3e. This is my first time using LuaLanes. Lanes has limitations about how it handles userdata across lanes, and requires people to implement a custom clone metamethod for their userdata if they want it to be handled by Lanes.
Noted.
Bottom line is I think these checks inside the for-loop have an outsized negative performance impact, and we should try to find ways to hoist them out of the loops. I believe it is reasonable to assume:
This is a laudable goal, although for now I'm not sure if it's something we can do...
So Pallene loops are a potential source of magnified gains or losses, since these are one of the key places where modern CPUs are designed to extract outsized gains if you use them right. Thus, I think amortizing checks at insertion or removal with a stored state flag, instead of checking at every read, may be worth it.
One problem is that if at any point the array "escapes" or we call a Lua function, then we can no longer optimize the tag checks because Lua might have modified the array behind our backs. This can make the optimization fragile, because small changes to the program might prevent the optimization from kicking in.
One psuedo-algorithm I was thinking of, was that I presume the Lua-internals have a common function point for adding or changing an element in the array part. If so, then if we create a homogeneous bit flag somewhere to track the state, then this can be used exclusively for true homogenity (and thus weed out the subvariants). I'm also just focusing on the "array part" of the Lua table. My thesis is we only care about the array part.
Just to be clear, this change would be inside the Lua interpreter or inside the code emitted by the Pallene compiler?
Also, for pointer types, maybe the checks aren't as bad since there will be pointer chasing which will probably kill performance anyway. But for numerical arrays/loops, I presume this is where we can get outsized performance gains if we keep the code lean.
Right.
This is a bit of a follow up to this other perfomance thread I posted at https://github.com/pallene-lang/pallene/discussions/561.
I've been looking a little deeper into the compiler's ability to unroll loops and autovectorize Pallene generated C code. For context, the era of instruction level parallelism, aka SIMD, is already well upon us, and this is one of the few ways to harvest out-sized gains on CPUs that are no longer ramping up clock speeds. This trend only seems to be continuing. So we should be on the constant look out for ways to take advantage of the hardware, when there are opportunities (especially if the C compiler/optimizer is willing to do most of the hardwork for us if we just do the "right" things.)
TLDR: I believe the checks generated in Pallene for-loops are preventing the compiler/optimizer from generating much better code, and thus sacrificing non-trivial amounts of performance gains we otherwise might be able to get for free. I'm hoping we can find a way to change things, such that the compiler can apply its much better optimizations.
So I experimented with a very simple for-loop that iterates through an array of doubles and adds up all the values in the array. I wrote 2 reference programs in C, and a progam in Pallene.
This is my simplified interpretation of what the internal Lua array looks like. (The real Lua uses a union as the first field, so this is simplified to reduce the union down to a double.) So I believe this case should represent the theoretical max of what we hope the emitted Pallene code can achieve.
In all the results, the "loop time" is the important part. Startup time doesn't count.
real 0m3.035s user 0m1.311s sys 0m1.723s
$ gcc -O3 -ftree-vectorize -funroll-loops container.c $ time ./a.out startup time (doesn't count) 3.736625 result: 576460751296790656.000000 c loop time 0.843595
real 0m5.075s user 0m1.905s sys 0m3.170s
$ pallenec basic_pln.pln $ time lua basic_lua.lua startup time (doesn't count) 17.048125 result: 5.7646075237053e+17 pln loop time 1.068943
real 0m19.432s user 0m15.983s sys 0m3.443s
$ export CFLAGS='-O3 -ftree-vectorize -funroll-loops' $ pallenec basic_pln.pln $ time lua basic_lua.lua startup time (doesn't count) 17.031442 result: 5.7646075237053e+17 pln loop time 1.063149
real 0m19.425s user 0m15.915s sys 0m3.506s
// pallene_renormalize_array(L, x1, x4, PALLENE_SOURCE_FILE, 9); / if (l_unlikely(!ttisfloat(slot))) { pallene_runtime_tag_check_error(L, "basic_pln.pln", 9, LUA_VNUMFLT, rawtt(slot), "array element"); } /
$ pallenec --emit-c basic_pln.pln $ cp basic_pln.c basic_pln_handmodify.c $ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c $ gcc -shared -fpic -o basic_pln.so basic_pln.o $ time lua basic_lua.lua startup time (doesn't count) 17.113667 result: 5.7646075237053e+17 pln loop time 0.84382
real 0m19.298s user 0m15.777s sys 0m3.520s
// pallene_renormalize_array(L, x1, x4, PALLENE_SOURCE_FILE, 9); $ pallenec --emit-c basic_pln.pln $ cp basic_pln.c basic_pln_handmodify.c $ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c $ gcc -shared -fpic -o basic_pln.so basic_pln.o $ time lua basic_lua.lua startup time (doesn't count) 17.080507 result: 5.7646075237053e+17 pln loop time 0.928175
real 0m19.333s user 0m15.920s sys 0m3.412s
/ if (l_unlikely(!ttisfloat(slot))) { pallene_runtime_tag_check_error(L, "basic_pln.pln", 9, LUA_VNUMFLT, rawtt(slot), "array element"); } / $ gcc -O3 -funroll-loops -ftree-vectorize -fpic -c -o basic_pln.o basic_pln_handmodify.c $ gcc -shared -fpic -o basic_pln.so basic_pln.o $ time lua basic_lua.lua startup time (doesn't count) 17.226762 result: 5.7646075237053e+17 pln loop time 1.066411
real 0m19.626s user 0m16.137s sys 0m3.488s