Open franz1981 opened 4 months ago
For reference: you can use https://github.com/elastic/elasticsearch/blob/1b7f3a0eae0e1f936f3c23e9e61548e5594adcff/server/src/main/java/org/elasticsearch/common/util/ByteUtils.java#L22 to avoid using UNSAFE and perform reading in long
batches, without relying on the unstable order
field within the ByteBuffer
.
Changed the ByteBuffer into local variable.
Results are kept as expected:
Benchmark (filename) Mode Cnt Score Error Units
FindingSemicolonBenchmark.linearScan measurements-100K.txt avgt 10 847,067 ± 14,452 us/op
FindingSemicolonBenchmark.swarLamport measurements-100K.txt avgt 10 833,276 ± 6,508 us/op
FindingSemicolonBenchmark.swarMycroft measurements-100K.txt avgt 10 751,737 ± 9,266 us/op
FindingSemicolonBenchmark.vectorAPI measurements-100K.txt avgt 10 40,798 ± 0,130 us/op
But using VarHandles added some overhead to the SWAR methods.
Results:
Benchmark (filename) Mode Cnt Score Error Units
FindingSemicolonBenchmark.linearScan measurements-100K.txt avgt 10 839,720 ± 4,026 us/op
FindingSemicolonBenchmark.swarLamport measurements-100K.txt avgt 10 955,048 ± 7,246 us/op
FindingSemicolonBenchmark.swarMycroft measurements-100K.txt avgt 10 859,489 ± 15,160 us/op
FindingSemicolonBenchmark.vectorAPI measurements-100K.txt avgt 10 41,702 ± 0,277 us/op
Diff:
diff --git a/src/main/java/com/morazow/benchmark/FindingSemicolonBenchmark.java b/src/main/java/com/morazow/benchmark/FindingSemicolonBenchmark.java
index fcf492a..142045c 100644
--- a/src/main/java/com/morazow/benchmark/FindingSemicolonBenchmark.java
+++ b/src/main/java/com/morazow/benchmark/FindingSemicolonBenchmark.java
@@ -72,6 +72,8 @@ public class FindingSemicolonBenchmark {
}
private static final long SEMICOLON_PATTERN = 0x3B3B3B3B3B3B3B3BL;
+ private static final VarHandle LITTLE_ENDIAN_LONG = MethodHandles.byteArrayViewVarHandle(long[].class,
+ ByteOrder.LITTLE_ENDIAN);
private static int findFirstByteLamport(long word) {
long input = word ^ SEMICOLON_PATTERN;
@@ -83,10 +85,9 @@ public class FindingSemicolonBenchmark {
@Benchmark
public void swarLamport(Measurements measurements) {
int numOfSemicolons = 0;
- final ByteBuffer byteBuffer = ByteBuffer.wrap(measurements.bytes).order(ByteOrder.LITTLE_ENDIAN);
int i = 0;
while (i + Long.BYTES <= measurements.bytes.length) {
- long word = byteBuffer.getLong(i);
+ long word = (long) LITTLE_ENDIAN_LONG.get(measurements.bytes, i);
int index = findFirstByteLamport(word);
if (index < Long.BYTES) {
i += index + 1;
@@ -114,10 +115,9 @@ public class FindingSemicolonBenchmark {
@Benchmark
public void swarMycroft(Measurements measurements) {
int numOfSemicolons = 0;
- final ByteBuffer byteBuffer = ByteBuffer.wrap(measurements.bytes).order(ByteOrder.LITTLE_ENDIAN);
int i = 0;
while (i + Long.BYTES <= measurements.bytes.length) {
- long word = byteBuffer.getLong(i);
+ long word = (long) LITTLE_ENDIAN_LONG.get(measurements.bytes, i);
int index = findFirstByteMycroft(word);
if (index < Long.BYTES) {
i += index + 1;
Thanks for the feedback @franz1981, I have added your suggestion in this PR https://github.com/morazow/java-simd-benchmarks/pull/2
measurements.bytes
is accessing bytes
from measurements
: you should save var bytes[] = measurements.bytes
out of such loop, instead.
Not only: ideally you want to use native endianess, if possible.
Another thing: make sure findfirstByteLamport
to be inlined, by PrintInlining
and searching if is inlined.
yet another: try using a "counter loop" ie use a for loop which is clearly using int
variables and a finite number of iterations, in a very clear way.
Now, very last thing: try having random inputs, if possible, by:
Ohh great!
I think I misunderstood you, yes saving bytes to local variable makes sense.
try having random inputs, if possible, by:
I saw similar method in Richard Startin's blog, thought maybe too much for this benchmark. But I am going to check it again, using single predefined file skews the benchmark for sure.
I saw similar method in Richard Startin's blog,
yeah, that thing is the key reason to get branch misprediction and SWAR/Vector to play some role. The fact you don't have an advantage by using SWAR is sign that something is off there...
Try fixing the SWAR implementations by re-using what is done at https://github.com/netty/netty/pull/10737
(including the micro-benchmarks)