question: slinc is about 3 times slower than jni (when using OpenJDK 17). Is this expected peformance?

i10416 commented 1 year ago

Hello. I ran a small comparative benchmark between slinc and jni, and the benchmark result shows slinc is about 3 times slower than jni. Is this expected peformance? I guess slinc(or Panama) abstraction is not free and I heard that there is some performance overhead for struct allocation in Panama, thus I assume this overhead is expected, but I want to hear author's opinion for my confidence.

context:

Scala 3.2.2
JVM: JDK 17.0.3, OpenJDK 64-Bit Server VM, 17.0.3+7-LTS
slinc: 0.1.1-110-7863cb
Apple clang version 13.1.6 (clang-1316.0.21.2.5)

src:

benchmark : https://github.com/i10416/bench/blob/main/bench/src/test/scala/ctimeSlincBenchmark.scala
ffi classes: https://github.com/i10416/bench/tree/main/core/src/main/scala

Benchmark	Mode	Cnt	Score	Error	Units
NativeBenchmarks.jni	avgt	5	5064.292	± 593.829	ns/op
NativeBenchmarks.slinc	avgt	5	16882.792	± 1172.054	ns/op

markehammons commented 1 year ago

I haven't had a good comparison with JNI so I can't say for sure. However one thing I note is that your code in the JNI implementation doesn't seem to handle deallocation at all, while the Slinc code does on account of the confined Scope. Scope.global would give a similar effect as what's going on in the JNI version.

That being said, it's possible there's more effective ways to implement the Slinc code to get closer to JNI performance. If you'd like to contribute some JNI benchmarks to the project I'd appreciate it!

i10416 commented 1 year ago

Thank you for feedback.

your code in the JNI implementation doesn't seem to handle deallocation at all

Ah, that's a good point. I slacked off deallocation😰 I will investigate it.

you'd like to contribute some JNI benchmarks to the project I'd appreciate it!

I'm happy to contribute JNI benchmarks but I'm concerned that benchmark workflow gets messy as JNI requires building native lib. In addition, I usually use sbt for my build, so it will take a bit to translate sbt build into mill's and make a PR.

i10416 commented 1 year ago

Panama competes with JNI or even outperforms JNI in some situation as shown in this talk(https://www.youtube.com/watch?v=4xFV-A7JToY), so I think(hope) it is possible to improve performance.

markehammons commented 1 year ago

I'm happy to contribute JNI benchmarks but I'm concerned that benchmark workflow gets messy as JNI requires building native lib. In addition, I usually use sbt for my build, so it will take a bit to translate sbt build into mill's and make a PR.

I'm already doing this in some capacity for my tests, so it's not a huge issue. I'm not too worried about it overcomplicating things. If you want, we can meet on google meet and I can show you how we can extend mill to do the build of the C++ part.

markehammons commented 1 year ago

Panama competes with JNI or even outperforms JNI in some situation as shown in this talk(https://www.youtube.com/watch?v=4xFV-A7JToY), so I think(hope) it is possible to improve performance.

It should be possible, and one way will be to drop the usage of MethodHandleFacade, a shim I put in place while Scala 3 didn't officially support MethodHandle.invoke. Now that Scala 3 does support these methods, I should be able to get better performance by using them directly.

There's other things to do to, but right now, the current version of Slinc is probably going to be slower. I'm currently reworking it to be better designed, less complex, and more suitable to build libraries that can be loaded by users using java 17, 18, 19, or whatever. Part of that process is me giving up on trying to do compile-time optimization. Where I'm hoping to gain performance back is JITC powered by runtime multi-stage compilation.

i10416 commented 1 year ago

we can meet on google meet and I can show you how we can extend mill to do the build of the C++ part.

That's great. I live in Japan now, but I plan to go to EU region for travel next week, so it is convenient to hold meets next week or later in terms of timezone.(I guess you are in EU from your GitHub profile and fr domain.) Thanks a lot.

By the way, https://github.com/scala-cli/libsodiumjni seems a good example of using JNI with mill, so I'll take a look at it for now to learn mill stuffs.

i10416 commented 1 year ago

With Java 19, SlinC is nearly as fast as JNI 😉!

JVM: OpenJDK Runtime Environment Zulu19.30+11-CA (build 19.0.1+10)

Benchmark	Mode	Cnt	Score	Error	Units
NativeBenchmarks.jni	avgt	5	4872.056	± 57.582	ns/op
NativeBenchmarks.slinc	avgt	5	5607.126	± 115.210	ns/op

i10416 commented 1 year ago

I added simpler benchmark, sorting 1,000,000 elements by qsort, that upcalls JVM method from native. It seems upcall has large overhead even if we use JNI. I couldn't find out why SlinC(or foreign API) takes 5 time longer than JNI.

JVM: OpenJDK Runtime Environment Zulu19.30+11-CA (build 19.0.1+10)

Benchmark		Mode	Cnt	Score	Error	Units
SimpleNativeCallBenchmarks.jniNativeQSort	using native comparator	avgt	5	4113.280	± 184.594	ns/op
SimpleNativeCallBenchmarks.jniQSort	using upcall comparator, destructively mutate original array	avgt	5	281968.369	± 4070.398	ns/op
SimpleNativeCallBenchmarks.slincQSortWithCopyBack	using upcall comparator, copy and transfer array	avgt	5	1609949.152	± 429499.499	ns/op
SimpleNativeCallBenchmarks.slincQSortWithoutCopyBack	using upcall comparator, copy and transfer array, discarding result	avgt	5	1574451.526	± 378398.468	ns/op

https://github.com/i10416/bench#qsort-benchmark

markehammons commented 1 year ago

What we can try, and what I don't have available at the moment, is creating an upcall from a method rather than a lambda. The way the foreign API suggests creating an upcall is targeting a method, but I used lambdas instead for ease of use.

markehammons commented 1 year ago

Another thing is that I think your bench is doing a lot of extra work in Slinc. I notice that for each call you recreate the upcall, use it, then toss it away. Upcall creation is expensive, and I don't think the JNI version is recreating its upcall binding for each iteration.

Can you try allocating the upcall in a static location (not in the benchmark loop) using Scope.global?

markehammons commented 1 year ago

Having cloned your bench and having the callback allocated once (rather than per benchmark iteration), I see a improvement in performance of Slinc's upcall code to just 2x slower than JNI, rather than 5x slower. I think there may be more performance improvements to be found, but first I should make us able to generate an upcall from a method rather than a lambda and see what the performance from that looks like.

i10416 commented 1 year ago

Thank you for feedback!

I see a improvement in performance of Slinc's upcall code to just 2x slower than JNI, rather than 5x slower.

Oh! it's significant!

i10416 commented 1 year ago

https://github.com/i10416/bench/commit/22323c91cfcd71066194b10f5d04b5d8cec05ea6

JFYI:

Hi, I can reproduce your improvement in performance by pre-allocating upcall in my local machine! Thanks.

scala-interop / slinc

question: slinc is about 3 times slower than jni (when using OpenJDK 17). Is this expected peformance? #81