morazow / java-simd-benchmarks

Java Vector API Benchmarks
MIT License
1 stars 0 forks source link

Fix benchmarks & SWAR impl #1

Open franz1981 opened 4 months ago

franz1981 commented 4 months ago

Try fixing the SWAR implementations by re-using what is done at https://github.com/netty/netty/pull/10737

(including the micro-benchmarks)

franz1981 commented 4 months ago

For reference: you can use https://github.com/elastic/elasticsearch/blob/1b7f3a0eae0e1f936f3c23e9e61548e5594adcff/server/src/main/java/org/elasticsearch/common/util/ByteUtils.java#L22 to avoid using UNSAFE and perform reading in long batches, without relying on the unstable order field within the ByteBuffer.

morazow commented 4 months ago

Changed the ByteBuffer into local variable.

Results are kept as expected:

Benchmark                                         (filename)  Mode  Cnt    Score    Error  Units
FindingSemicolonBenchmark.linearScan   measurements-100K.txt  avgt   10  847,067 ± 14,452  us/op
FindingSemicolonBenchmark.swarLamport  measurements-100K.txt  avgt   10  833,276 ±  6,508  us/op
FindingSemicolonBenchmark.swarMycroft  measurements-100K.txt  avgt   10  751,737 ±  9,266  us/op
FindingSemicolonBenchmark.vectorAPI    measurements-100K.txt  avgt   10   40,798 ±  0,130  us/op

But using VarHandles added some overhead to the SWAR methods.

Results:

Benchmark                                         (filename)  Mode  Cnt    Score    Error  Units
FindingSemicolonBenchmark.linearScan   measurements-100K.txt  avgt   10  839,720 ±  4,026  us/op
FindingSemicolonBenchmark.swarLamport  measurements-100K.txt  avgt   10  955,048 ±  7,246  us/op
FindingSemicolonBenchmark.swarMycroft  measurements-100K.txt  avgt   10  859,489 ± 15,160  us/op
FindingSemicolonBenchmark.vectorAPI    measurements-100K.txt  avgt   10   41,702 ±  0,277  us/op

Diff:

diff --git a/src/main/java/com/morazow/benchmark/FindingSemicolonBenchmark.java b/src/main/java/com/morazow/benchmark/FindingSemicolonBenchmark.java
index fcf492a..142045c 100644
--- a/src/main/java/com/morazow/benchmark/FindingSemicolonBenchmark.java
+++ b/src/main/java/com/morazow/benchmark/FindingSemicolonBenchmark.java
@@ -72,6 +72,8 @@ public class FindingSemicolonBenchmark {
     }

     private static final long SEMICOLON_PATTERN = 0x3B3B3B3B3B3B3B3BL;
+    private static final VarHandle LITTLE_ENDIAN_LONG = MethodHandles.byteArrayViewVarHandle(long[].class,
+            ByteOrder.LITTLE_ENDIAN);

     private static int findFirstByteLamport(long word) {
         long input = word ^ SEMICOLON_PATTERN;
@@ -83,10 +85,9 @@ public class FindingSemicolonBenchmark {
     @Benchmark
     public void swarLamport(Measurements measurements) {
         int numOfSemicolons = 0;
-        final ByteBuffer byteBuffer = ByteBuffer.wrap(measurements.bytes).order(ByteOrder.LITTLE_ENDIAN);
         int i = 0;
         while (i + Long.BYTES <= measurements.bytes.length) {
-            long word = byteBuffer.getLong(i);
+            long word = (long) LITTLE_ENDIAN_LONG.get(measurements.bytes, i);
             int index = findFirstByteLamport(word);
             if (index < Long.BYTES) {
                 i += index + 1;
@@ -114,10 +115,9 @@ public class FindingSemicolonBenchmark {
     @Benchmark
     public void swarMycroft(Measurements measurements) {
         int numOfSemicolons = 0;
-        final ByteBuffer byteBuffer = ByteBuffer.wrap(measurements.bytes).order(ByteOrder.LITTLE_ENDIAN);
         int i = 0;
         while (i + Long.BYTES <= measurements.bytes.length) {
-            long word = byteBuffer.getLong(i);
+            long word = (long) LITTLE_ENDIAN_LONG.get(measurements.bytes, i);
             int index = findFirstByteMycroft(word);
             if (index < Long.BYTES) {
                 i += index + 1;
morazow commented 4 months ago

Thanks for the feedback @franz1981, I have added your suggestion in this PR https://github.com/morazow/java-simd-benchmarks/pull/2

franz1981 commented 4 months ago

measurements.bytes is accessing bytes from measurements: you should save var bytes[] = measurements.bytes out of such loop, instead.

Not only: ideally you want to use native endianess, if possible.

Another thing: make sure findfirstByteLamport to be inlined, by PrintInlining and searching if is inlined.

yet another: try using a "counter loop" ie use a for loop which is clearly using int variables and a finite number of iterations, in a very clear way.

Now, very last thing: try having random inputs, if possible, by:

morazow commented 4 months ago

Ohh great!

I think I misunderstood you, yes saving bytes to local variable makes sense.

try having random inputs, if possible, by:

I saw similar method in Richard Startin's blog, thought maybe too much for this benchmark. But I am going to check it again, using single predefined file skews the benchmark for sure.

franz1981 commented 4 months ago

I saw similar method in Richard Startin's blog,

yeah, that thing is the key reason to get branch misprediction and SWAR/Vector to play some role. The fact you don't have an advantage by using SWAR is sign that something is off there...