GraalPython performance is terrible when Pandas is used

subashatreya commented 1 year ago

I installed GraalVM 22.3.1, Python Language, and NumPy and Pandas as described in: https://www.graalvm.org/latest/reference-manual/python/

Here is my sample test.py

import pandas as pd
import time
start = time.time() * 1000
data_dict = []
for i in range(100):
    data_dict.append({
            'name': 'User'+str(i),
            'score1': i,
            'score2': i
    })

df = pd.DataFrame(data_dict)
mean = df['score1'].mean()

end = time.time() * 1000

print("Time Taken(ms): " + str(end-start))

++++

$JAVA_HOME/languages/python/bin/python test.py Time Taken(ms): 1139.0

If I run the same script with default CPython, then it finishes in under 5ms. The difference of 5ms vs 1139ms is huge and therefore the performance is not even comparable.

subashatreya commented 1 year ago

I tried a slightly different version with a loop in an attempt to "warm-up" GraalPython. but the CPython is still a LOT faster. Also, note that I am not measure "Total Script Time", I am measuring time after the script has already started running, just for Pandas function call. So, per my understanding, compilation should have already happened before I even measure the time, and hence it should not have caused this delay.

data_dict = []
for i in range(100):
    data_dict.append({
            'name': 'User'+str(i),
            'score1': i,
            'score2': i
    })

for i in range(10):
    start = time.time() * 1000
    df = pd.DataFrame(data_dict)
    mean = df['score1'].mean()
    end = time.time() * 1000
    print("Time Taken(ms): " + str(end-start))

Here are the numbers from CPython Time Taken(ms): 4.468994140625 Time Taken(ms): 0.679443359375 Time Taken(ms): 0.52490234375 Time Taken(ms): 0.4150390625 Time Taken(ms): 0.39501953125 Time Taken(ms): 0.4150390625 Time Taken(ms): 0.385009765625 Time Taken(ms): 0.447021484375 Time Taken(ms): 0.378173828125 Time Taken(ms): 0.36279296875

And here are the numbers from GraalPython Time Taken(ms): 536.0 Time Taken(ms): 292.0 Time Taken(ms): 294.0 Time Taken(ms): 183.0 Time Taken(ms): 172.0 Time Taken(ms): 260.0 Time Taken(ms): 151.0 Time Taken(ms): 140.0 Time Taken(ms): 285.0 Time Taken(ms): 167.0

subashatreya commented 1 year ago

Third attempt to warm up GraalPython. I added sleep() inside the loop and made the loop run 100 times instead of 10.

data_dict = []
for i in range(100):
    data_dict.append({
            'name': 'User'+str(i),
            'score1': i,
            'score2': i
    })

for i in range(100):
    start = time.time() * 1000
    df = pd.DataFrame(data_dict)
    mean = df['score1'].mean()
    end = time.time() * 1000
    print("Time Taken(ms): " + str(end-start))
    time.sleep(1)

The numbers now are slightly better but still 100 times worse than CPython Time Taken(ms): 16.0 Time Taken(ms): 18.0 Time Taken(ms): 19.0 Time Taken(ms): 18.0 Time Taken(ms): 17.0 Time Taken(ms): 16.0 Time Taken(ms): 19.0 Time Taken(ms): 20.0 Time Taken(ms): 18.0 Time Taken(ms): 20.0

Here are the numbers from CPython Time Taken(ms): 0.679443359375 Time Taken(ms): 0.52490234375 Time Taken(ms): 0.4150390625 Time Taken(ms): 0.39501953125 Time Taken(ms): 0.4150390625 Time Taken(ms): 0.385009765625 Time Taken(ms): 0.447021484375 Time Taken(ms): 0.378173828125 Time Taken(ms): 0.36279296875

timfel commented 1 year ago

Yes, this is about what is to be expected from this benchmark. There is a lot of boundary crossing between Python code and C code. If this is truly representative of all your workload does, GraalPy has little chance of beating CPython here.

timfel commented 1 year ago

For reference, I modified your script as such:

import pandas as pd
import time

def run():
    data_dict = []
    for i in range(100):
        data_dict.append({
            'name': 'User'+str(i),
            'score1': i,
            'score2': i
        })
    df = pd.DataFrame(data_dict)
    mean = df['score1'].mean()
    return mean

total = 0
while True:
    for i in range(50):
        start = time.time() * 1000
        run()
        end = time.time() * 1000
    total += 50
    print(total, ": 50 iterations took ", end - start, "ms")

CPython averages around 1ms/50 iterations here for me. GraalPy 22.3 takes quite some time to warm up. After about 7000 iterations it is pretty steady for me at ~24ms/50 iterations. GraalPy 23.0 with our new native execution mode warms up faster, but peaks at 33ms/50 iterations after 400 iterations total.

Overall, this workload is simply to small for us to offset the additional data copy and boundary crossings we have for each element of the dict when creating the pandas frame.

subashatreya commented 1 year ago

Well, my actual application workload is a lot more complex, but I just showed this workload a simple reproducible example.

I can try running my workload with few thousand dummy values to see if that helps it warm up. My requirement is not necessarily to beat CPython, but to execute one Pandas-based computation within 10ms. The number I am observing now is about 400ms. This number must go down from 400ms to 10ms for the application to be practically viable.

Also, in this experiment, I called $JAVA_HOME/languages/python/bin/python directly. However, my real application is a Java-based server application running on GraalVM and Python code is embedded as Polyglot guest language. So, there is even more data conversion going from the guest Python to Host Java language.

timfel commented 1 year ago

If you can, I would recommend you run your actual workload with --cpusampler to see where time is actually spent. Then I could check if that is dominated by something we can fix or not.

timfel commented 1 year ago

Ah, I just found something. We actually have a deopt loop in this benchmark - the following two deopts repeat 100s of times per second:

[Deoptimization initiated
    name: PyErr_GetExcInfo (SubstrateCompilation-6793)
    sp: 0x7ffe1e11f440  ip: 0x7f669fbec397
    reason: TransferToInterpreter  action: InvalidateReprofile
    debugId: 13  speculation: jdk.vm.ci.meta.SpeculationLog$NoSpeculationReason@c2584d3
    stack trace that triggered deoptimization:
        at com.oracle.graal.python.nodes.util.ExceptionStateNodes$GetCaughtExceptionNode.executeFromNative(ExceptionStateNodes.java:125)
        at com.oracle.graal.python.builtins.modules.cext.PythonCextErrBuiltins$PyTruffleErr_GetExcInfo.info(PythonCextErrBuiltins.java:364)

and

[Deoptimization initiated
    name: PyTruffle_ByteArrayToNative (SubstrateCompilation-6794)
    sp: 0x7ffe1e11ea20  ip: 0x7f669fbd3b33
    reason: BoundsCheckException  action: InvalidateReprofile
    debugId: 17  speculation: jdk.vm.ci.meta.SpeculationLog$NoSpeculationReason@c2584d3
    stack trace that triggered deoptimization:
        at com.oracle.truffle.host.HostObject$ArrayGet.doByte(HostObject.java:3268)
        at com.oracle.truffle.host.HostObjectFactory$ArrayGetNodeGen$Inlined.execute(HostObjectFactory.java:473)
        at com.oracle.truffle.host.HostObject$ReadArrayElement.doArray(HostObject.java:800)
        at com.oracle.truffle.host.HostObjectGen$InteropLibraryExports$Cached.readArrayElement(HostObjectGen.java:2304)
        at com.oracle.truffle.api.interop.InteropLibraryGen$CachedDispatch.readArrayElement(InteropLibraryGen.java:8819)
        at com.oracle.truffle.llvm.runtime.nodes.intrinsics.interop.LLVMPolyglotRead$LLVMPolyglotGetArrayElement.doIntrinsic(LLVMPolyglotRead.java:124)

I will try to take a look (/cc fyi @fangerer)

subashatreya commented 1 year ago

There is a small bug in the code @timfel posted above. It is still measuring time for one iteration, not 50 iterations as intended. start and end time measurements must be outside the inner loop.

Here is the new version of the script - I added some more statements in the run() function to match real-world workload


import pandas as pd
import pandas_ta as ta
import time

def run():
    data_dict = []
    for i in range(100):
        data_dict.append({
            'name': 'User' + str(i),
            'score1': i,
            'score2': i
        })
    df = pd.DataFrame(data_dict)
    sma = ta.sma(df["score1"], length=10)
    ema = ta.ema(df['score1'], length=10)
    lastSma = sma.iat[-1].item()
    lastEma = ema.iat[-1].item()
    return (lastSma + lastEma) / 2

total = 0
while True:
    start = time.time() * 1000
    for i in range(50):
        run()
    end = time.time() * 1000
    total += 50
    print(total, ": 50 iterations took ", end - start, "ms")

First, here is the result for CPython:

3150 : 50 iterations took 49.900146484375 ms 3200 : 50 iterations took 49.91796875 ms 3250 : 50 iterations took 56.847900390625 ms 3300 : 50 iterations took 51.123779296875 ms 3350 : 50 iterations took 49.0888671875 ms 3400 : 50 iterations took 44.664794921875 ms 3450 : 50 iterations took 49.506591796875 ms 3500 : 50 iterations took 47.31103515625 ms 3550 : 50 iterations took 52.6630859375 ms 3600 : 50 iterations took 45.040771484375 ms 3650 : 50 iterations took 43.5322265625 ms 3700 : 50 iterations took 45.740966796875 ms 3750 : 50 iterations took 48.583251953125 ms 3800 : 50 iterations took 48.193115234375 ms 3850 : 50 iterations took 43.912109375 ms 3900 : 50 iterations took 43.459716796875 ms 3950 : 50 iterations took 54.65380859375 ms 4000 : 50 iterations took 51.01416015625 ms 4050 : 50 iterations took 47.460693359375 ms

Now, here is the result for GraalPy after 14,000 warm-up iterations

14450 : 50 iterations took 2208.0 ms 14500 : 50 iterations took 2091.0 ms 14550 : 50 iterations took 2121.0 ms 14600 : 50 iterations took 2016.0 ms 14650 : 50 iterations took 2105.0 ms 14700 : 50 iterations took 2121.0 ms 14750 : 50 iterations took 8746.0 ms 14800 : 50 iterations took 1976.0 ms 14850 : 50 iterations took 2429.0 ms 14900 : 50 iterations took 2044.0 ms 14950 : 50 iterations took 2225.0 ms 15000 : 50 iterations took 2146.0 ms

Then I also ran the same script from Java, which is what really happens in my app.


    public void testPythonFunction() throws Exception{
        String pythonCode = """
    import pandas as pd
    import pandas_ta as ta
    import polyglot
    def run():
        data_dict = []
        for i in range(100):
            data_dict.append({
                'name': 'User' + str(i),
                'score1': i,
                'score2': i
            })
        df = pd.DataFrame(data_dict)
        sma = ta.sma(df["score1"], length=5)
        ema = ta.ema(df['score1'], length=5)
        lastSma = sma.iat[-1].item()
        lastEma = ema.iat[-1].item()
        return (lastSma + lastEma) / 2
    polyglot.export_value("run", run)
                """;

        try (Context context = Context.newBuilder()
                .allowAllAccess(true)
                .option("python.Executable", VENV_EXECUTABLE)
                .option("python.ForceImportSite", "true")
                .build()) {

            long parsingStart = System.currentTimeMillis();
            Source source = Source.newBuilder("python", pythonCode, "script.py").build();
            Value func = context.eval(source);
            long parsingEnd = System.currentTimeMillis();
            System.out.println("Parsing Script took: " + (parsingEnd - parsingStart) + " ms" );

            int total = 0;
            while(true) {

                long start = System.currentTimeMillis();
                for(int i=0; i < 50; i++) {
                    func.execute().asDouble();                  
                }
                long end = System.currentTimeMillis();
                total += 50;
                System.out.println(total+ ": 50 iterations took " + (end - start) + " ms");

            }

        }
    }

This is the results from Java after 7000 warm-up iterations

7000: 50 iterations took 4978 ms 7050: 50 iterations took 4552 ms 7100: 50 iterations took 4382 ms 7150: 50 iterations took 4660 ms 7200: 50 iterations took 7241 ms 7250: 50 iterations took 4610 ms 7300: 50 iterations took 4784 ms 7350: 50 iterations took 4697 ms 7400: 50 iterations took 4496 ms 7450: 50 iterations took 4707 ms 7500: 50 iterations took 4100 ms 7550: 50 iterations took 4267 ms 7600: 50 iterations took 4208 ms 7650: 50 iterations took 4196 ms 7700: 50 iterations took 4125 ms 7750: 50 iterations took 4187 ms 7800: 50 iterations took 3847 ms 7850: 50 iterations took 4319 ms 7900: 50 iterations took 3847 ms 7950: 50 iterations took 3850 ms 8000: 50 iterations took 4116 ms

So if we compare time from Java (about 4000ms) to CPython (40ms), that is about 100 times slower.

In addition, the time for initial code parsing and context initialization was also much longer than expected. Parsing Script took: 139267 ms

timfel commented 1 year ago

Yes, I suspect with this benchmark we won't get much better than what I had above (20-30x slower than cpython - maybe with some work it'll be 10x slower only). This particular benchmark is dominated by copying data from Python to native memory.

subashatreya commented 1 year ago

Thanks for the comments. Please note that when observed from Java, it was 100x slower (not just 20-30x slower) than CPython. The time for CPython was about 40ms and time from Java was about 4000ms.

Also, I could not completely understand why this benchmark is dominated by copying data from Python to native memory. The code above is calculating simple moving average (SMA) and exponential moving average (EMA) with the rolling window of 5 data points on Pandas dataframe with 100 rows. Besides creating dummy data and loading it up in Pandas dataframe, I don't see copying of data anywhere else in the code.

What would be the example of code that is not dominated by copying data from Python to native memory?

Anyway, let me know if there is a new build to try this again. In the meantime, I will now have to start exploring other ways to run user-supplied insecure Python code in Java server app.

timfel commented 1 year ago

Also, I could not completely understand why this benchmark is dominated by copying data from Python to native memory

The calculation in this benchmark is nothing compared to loading all the data into a pandas dataframe on GraalPy. I measured the time around creating the dummy data, creating the dataframe, and then calculating the mean.

On GraalPy:
Creating data took: 0.0 ms
Creating data frame took: 62.00003623962402 ms
Calculating mean took 3.999948501586914 ms
1000 : 50 iterations took  2965.0 ms

On Python:
Creating data took: 0.05245208740234375 ms
Creating data frame took: 0.6833076477050781 ms
Calculating mean took 0.34689903259277344 ms
1000 : 50 iterations took  60.477294921875 ms

While we are indeed slower on everything involving native code here, the largest hit for us is creating the dataframe. That's because the entire data is copied into the native memory from Java, and then copied again by pandas. For CPython, they only copy from the native CPython object to pandas.

oracle / graalpython

GraalPython performance is terrible when Pandas is used #334