scala-interop / slinc

Scala <-> C interop
GNU Affero General Public License v3.0
102 stars 5 forks source link

question: slinc is about 3 times slower than jni (when using OpenJDK 17). Is this expected peformance? #81

Open i10416 opened 1 year ago

i10416 commented 1 year ago

Hello. I ran a small comparative benchmark between slinc and jni, and the benchmark result shows slinc is about 3 times slower than jni. Is this expected peformance? I guess slinc(or Panama) abstraction is not free and I heard that there is some performance overhead for struct allocation in Panama, thus I assume this overhead is expected, but I want to hear author's opinion for my confidence.

context:

src:

Benchmark Mode Cnt Score Error Units
NativeBenchmarks.jni avgt 5 5064.292 ± 593.829 ns/op
NativeBenchmarks.slinc avgt 5 16882.792 ± 1172.054 ns/op
markehammons commented 1 year ago

I haven't had a good comparison with JNI so I can't say for sure. However one thing I note is that your code in the JNI implementation doesn't seem to handle deallocation at all, while the Slinc code does on account of the confined Scope. Scope.global would give a similar effect as what's going on in the JNI version.

That being said, it's possible there's more effective ways to implement the Slinc code to get closer to JNI performance. If you'd like to contribute some JNI benchmarks to the project I'd appreciate it!

i10416 commented 1 year ago

Thank you for feedback.

your code in the JNI implementation doesn't seem to handle deallocation at all

Ah, that's a good point. I slacked off deallocation😰 I will investigate it.

you'd like to contribute some JNI benchmarks to the project I'd appreciate it!

I'm happy to contribute JNI benchmarks but I'm concerned that benchmark workflow gets messy as JNI requires building native lib. In addition, I usually use sbt for my build, so it will take a bit to translate sbt build into mill's and make a PR.

i10416 commented 1 year ago

Panama competes with JNI or even outperforms JNI in some situation as shown in this talk(https://www.youtube.com/watch?v=4xFV-A7JToY), so I think(hope) it is possible to improve performance.

markehammons commented 1 year ago

I'm happy to contribute JNI benchmarks but I'm concerned that benchmark workflow gets messy as JNI requires building native lib. In addition, I usually use sbt for my build, so it will take a bit to translate sbt build into mill's and make a PR.

I'm already doing this in some capacity for my tests, so it's not a huge issue. I'm not too worried about it overcomplicating things. If you want, we can meet on google meet and I can show you how we can extend mill to do the build of the C++ part.

markehammons commented 1 year ago

Panama competes with JNI or even outperforms JNI in some situation as shown in this talk(https://www.youtube.com/watch?v=4xFV-A7JToY), so I think(hope) it is possible to improve performance.

It should be possible, and one way will be to drop the usage of MethodHandleFacade, a shim I put in place while Scala 3 didn't officially support MethodHandle.invoke. Now that Scala 3 does support these methods, I should be able to get better performance by using them directly.

There's other things to do to, but right now, the current version of Slinc is probably going to be slower. I'm currently reworking it to be better designed, less complex, and more suitable to build libraries that can be loaded by users using java 17, 18, 19, or whatever. Part of that process is me giving up on trying to do compile-time optimization. Where I'm hoping to gain performance back is JITC powered by runtime multi-stage compilation.

i10416 commented 1 year ago

we can meet on google meet and I can show you how we can extend mill to do the build of the C++ part.

That's great. I live in Japan now, but I plan to go to EU region for travel next week, so it is convenient to hold meets next week or later in terms of timezone.(I guess you are in EU from your GitHub profile and fr domain.) Thanks a lot.

By the way, https://github.com/scala-cli/libsodiumjni seems a good example of using JNI with mill, so I'll take a look at it for now to learn mill stuffs.

i10416 commented 1 year ago

With Java 19, SlinC is nearly as fast as JNI 😉!

Benchmark Mode Cnt Score Error Units
NativeBenchmarks.jni avgt 5 4872.056 ± 57.582 ns/op
NativeBenchmarks.slinc avgt 5 5607.126 ± 115.210 ns/op
i10416 commented 1 year ago

I added simpler benchmark, sorting 1,000,000 elements by qsort, that upcalls JVM method from native. It seems upcall has large overhead even if we use JNI. I couldn't find out why SlinC(or foreign API) takes 5 time longer than JNI.

JVM: OpenJDK Runtime Environment Zulu19.30+11-CA (build 19.0.1+10)

Benchmark Mode Cnt Score Error Units
SimpleNativeCallBenchmarks.jniNativeQSort using native comparator avgt 5 4113.280 ± 184.594 ns/op
SimpleNativeCallBenchmarks.jniQSort using upcall comparator, destructively mutate original array avgt 5 281968.369 ± 4070.398 ns/op
SimpleNativeCallBenchmarks.slincQSortWithCopyBack using upcall comparator, copy and transfer array avgt 5 1609949.152 ± 429499.499 ns/op
SimpleNativeCallBenchmarks.slincQSortWithoutCopyBack using upcall comparator, copy and transfer array, discarding result avgt 5 1574451.526 ± 378398.468 ns/op

https://github.com/i10416/bench#qsort-benchmark

markehammons commented 1 year ago

What we can try, and what I don't have available at the moment, is creating an upcall from a method rather than a lambda. The way the foreign API suggests creating an upcall is targeting a method, but I used lambdas instead for ease of use.

markehammons commented 1 year ago

Another thing is that I think your bench is doing a lot of extra work in Slinc. I notice that for each call you recreate the upcall, use it, then toss it away. Upcall creation is expensive, and I don't think the JNI version is recreating its upcall binding for each iteration.

Can you try allocating the upcall in a static location (not in the benchmark loop) using Scope.global?

markehammons commented 1 year ago

Having cloned your bench and having the callback allocated once (rather than per benchmark iteration), I see a improvement in performance of Slinc's upcall code to just 2x slower than JNI, rather than 5x slower. I think there may be more performance improvements to be found, but first I should make us able to generate an upcall from a method rather than a lambda and see what the performance from that looks like.

i10416 commented 1 year ago

Thank you for feedback!

I see a improvement in performance of Slinc's upcall code to just 2x slower than JNI, rather than 5x slower.

Oh! it's significant!

i10416 commented 1 year ago

https://github.com/i10416/bench/commit/22323c91cfcd71066194b10f5d04b5d8cec05ea6

JFYI:

Hi, I can reproduce your improvement in performance by pre-allocating upcall in my local machine! Thanks.