pallene-lang / pallene

Pallene Compiler
MIT License
694 stars 29 forks source link

Slow string performance #505

Open ewmailing opened 3 years ago

ewmailing commented 3 years ago

Discussed in https://github.com/pallene-lang/pallene/discussions/504

Originally posted by **ewmailing** November 11, 2021 Hi, I'm not doing any serious benchmarking yet, but I have been spot checking here and there to see what Pallene's performance is like. So for some context, I've been writing a little stock backtester. The basic idea is you load a bunch of historical stock data and you iterate through it. For each tick, you run some custom rules depending on what ideas you want to test. Usually you will be computing various technical indicators here (which usually involves running through arrays and doing some math operations), and making decisions (branches) to buy and sell. I speculated that Pallene might do well in this type of use case, particularly the iterating through arrays and doing math. I was hoping that native execution, plus lower overhead (to help things like cache locality and prefetching) might yield good results in Pallene. As a quick and dirty benchmark reference, I used the Python based Backtesting.py library and used a stripped down version of their hello world tutorial which makes a simple buy/sell decision based on Simple Moving Average (SMA) cross-overs. I then implemented from scratch my Pallene approximate equivalent. I haven't done much Python development, so I immediately got first hand experience of how slow Python is. Wow. And in this hello world, the SMA arrays are pre-computed at init, so there is no serious math crunching done in the main loop. So this this isn't just an issue with Python big nums being slow, but many other things in Python being slow. Anyway, I digress. To make measuring a little clearer, I put a for-loop of 1000 around the main loops of the programs to extend out the run times. This would help mitigate the effects of the slow loading times of Python. So Python took 58 seconds. Pallene took .3 seconds Lua (using Pallene's --emit-lua) took .9 seconds. So Pallene was about 3 times faster than pure Lua. And both were a magnitude faster than Python. I then wrote a second math heavy experiment that was centered around options trading, which has a lot more number crunching (and uses heavy instructions like math.ln). Unfortunately, Backtesting.py doesn't have any options support, so I could not compare to Python. But Pallene was about 4x faster than Lua in this case. So after that, I needed to implement some graphs for the results for my backtester so I could visualize the results. I am doing a fast and dirty thing, so I am generating HTML/JS graphs that can be opened in a web browser. So basically my code needs to take all my all my data in a program, and write out a web page with a JavaScript program that generates the desired graph. So there is a lot of string stuff here. Since there are few string built-ins right now in Pallene, my code is pretty dumb and involves hard coded strings I wrote into my code concatenated with on-the-fly strings, using the .. operator. What has surprised me about this performance is Pallene seems to be slower in this case than Lua (via --emit-lua). I did compile with -O2 and -O3 to make sure it wasn't just a compiler optimization problem, but Pallene was still slower. And I didn't investigate further, so I don't know if the bottleneck is the actual string code, or the Pallene io.write function. In Pallene, it took 1.12 seconds. In Lua, it took 0.27 seconds. (I did not loop these 1000 times and this measurement was just the graph generation function, and not the entire program, whereas earlier, I was timing the whole program from launch to exit.) Also I think Pallene's start up time may be a little slower, although I didn't do a clean measurement of start up times. But that start up time is negligible compared the string handling difference. I'm guessing this means loading the .so takes longer than loading/interpreting the emitted Lua script. Anyway, I was wondering if this performance was known about, and I thought I would mention it if you didn't. I'm thinking of re-writing/moving the graph generation code to Lua anyway because I think I want to use string.gsub to make it easier for me to insert code into my JavaScript template.
ewmailing commented 3 years ago

I have prepared a stripped down version of my program that reproduces this problem. I place these files in pallene/experiments/stocks3 stringperf.tar.gz

hugomg commented 3 years ago

Thanks! Could you plase clarify how long it takes to run the Lua and Pallene versions of this benchmark? I'm confused if this is the one where Pallene took 0.3 seconds or 1.2 seconds or if it is yet another number.

Another thing that I wonder is if there is a way to build a smaller example. Does this behavior only appear in large Pallene files?

ewmailing commented 2 years ago

No, the timings are different for these files. This is actually a stripped down version of my program. I originally tried to do a much simpler program from scratch, but it did not reproduce the behavior. So I had to go back to my original program and start ripping things out. This all took me a lot of time to do, but I feel the program submitted here is a reasonable compromise. I don't think it as bad as it seems. I believe the performance bottleneck is contained to the function GenerateBackTestPlotForChartJS. This is the function with all the .. operations. All the other stuff is just filling stub data so this function can do work. And this function also has another fake-out in that there is a very long string constant of Javascript code that makes the function look scarier than it is. It is just a string constant you can skim past down to find the real body of code being run. (Interesting tidbit, the string constant is so long, it triggers a C compiler warning for ISO C89/C90 mode.)

So the timings for these attached files on my computer are as follows:

Pallene: (made sure .lua was deleted)

$ ./pallenec -O2   experiments/stocks3/stringperf.pln 
$ time ./vm/src/lua experiments/stocks3/strbug.lua 
SavePlot elapsed time: 7.74

real    0m7.904s
user    0m4.576s
sys 0m3.328s

Lua: (made sure .so file was deleted)

$ ./pallenec --emit-lua   experiments/stocks3/stringperf.pln
$ time ./vm/src/lua experiments/stocks3/strbug.lua 
SavePlot elapsed time: 2.73

real    0m2.733s
user    0m2.216s
sys 0m0.516s

P.S. My test program here adds some extra loops to pad out the run time because my real program had a lot more data.