Closed catap closed 5 years ago
I think there may be a misunderstanding. ByteBuffer.duplicate()
only creates a copy of the ByteBuffer
. It does not create a copy of the underlying data. Do you have any profiling to show that this change significantly impacts the number of allocations? Similarly, as duplicate()
is such a light operation (creating a tiny object) compared to everything else that the reader does, I am a bit skeptical that this is much of a win for multithreaded applications.
The ByteBuffer
get*()
methods are not documented as thread safe, which is why the BufferHolder
was added in the first place.
Edit: I re-ran the branch benchmark a few times and took the best one.
This also appears significantly slower than master for me using the benchmark in the repo.
Master:
No caching
Warming up
Requests per second: 133651
Requests per second: 155788
Requests per second: 146648
Benchmarking
Requests per second: 157624
Requests per second: 161356
Requests per second: 164758
Requests per second: 165471
Requests per second: 163246
With caching
Warming up
Requests per second: 672478
Requests per second: 820981
Requests per second: 819900
Benchmarking
Requests per second: 816492
Requests per second: 837111
Requests per second: 837186
Requests per second: 834656
Requests per second: 827388
This branch:
No caching
Warming up
Requests per second: 91969
Requests per second: 91193
Requests per second: 97935
Benchmarking
Requests per second: 103758
Requests per second: 103199
Requests per second: 102987
Requests per second: 102629
Requests per second: 103557
With caching
Warming up
Requests per second: 546258
Requests per second: 598943
Requests per second: 602783
Benchmarking
Requests per second: 602465
Requests per second: 608707
Requests per second: 606630
Requests per second: 608863
Requests per second: 604225
@oschwald well, I made my benchmark over OpenJDK's JMH.
I run it as
java -jar benchmark/target/benchmarks.jar -t 1 -f 1 -tu us -bm avgt -prof gc -jvmArgs -DmaxmindFile=$(pwd)/GeoIP2-City.mmdb MaxmindBenchmark.city_16
on
JDK 9.0.4, Java HotSpot(TM) 64-Bit Server VM, 9.0.4+11
Short summary: my benchmark confirm that it is slower (30us vs 25us) but it uses less memory (1.2Gb/sec vs 0.9Gb/sec and costs 2 times less GC cycles)
About performance costs I'm on this right now.
And you mentored a good point about thread safing of ByteBuffer.get(..)
. I expect that it is (because it is read without any side effects) but I haven't got any proof.
master
Benchmark (mode) Mode Cnt Score Error Units
MaxmindBenchmark.city_16 MEMORY avgt 5 25.248 ± 2.455 us/op
MaxmindBenchmark.city_16:·gc.alloc.rate MEMORY avgt 5 1139.718 ± 108.498 MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm MEMORY avgt 5 31680.001 ± 0.001 B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods' MEMORY avgt 5 0.015 ± 0.019 MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm MEMORY avgt 5 0.416 ± 0.514 B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen MEMORY avgt 5 1141.999 ± 112.903 MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm MEMORY avgt 5 31742.894 ± 338.623 B/op
MaxmindBenchmark.city_16:·gc.count MEMORY avgt 5 822.000 counts
MaxmindBenchmark.city_16:·gc.time MEMORY avgt 5 616.000 ms
MaxmindBenchmark.city_16 MEMORY_MAPPED avgt 5 23.931 ± 2.005 us/op
MaxmindBenchmark.city_16:·gc.alloc.rate MEMORY_MAPPED avgt 5 1203.032 ± 98.991 MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm MEMORY_MAPPED avgt 5 31696.001 ± 0.001 B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods' MEMORY_MAPPED avgt 5 0.015 ± 0.002 MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm MEMORY_MAPPED avgt 5 0.384 ± 0.046 B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen MEMORY_MAPPED avgt 5 1206.010 ± 100.458 MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm MEMORY_MAPPED avgt 5 31774.357 ± 272.890 B/op
MaxmindBenchmark.city_16:·gc.count MEMORY_MAPPED avgt 5 576.000 counts
MaxmindBenchmark.city_16:·gc.time MEMORY_MAPPED avgt 5 449.000 ms
this branch:
Benchmark (mode) Mode Cnt Score Error Units
MaxmindBenchmark.city_16 MEMORY avgt 5 29.200 ± 1.106 us/op
MaxmindBenchmark.city_16:·gc.alloc.rate MEMORY avgt 5 923.087 ± 35.062 MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm MEMORY avgt 5 29688.001 ± 0.001 B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods' MEMORY avgt 5 0.009 ± 0.004 MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm MEMORY avgt 5 0.279 ± 0.144 B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen MEMORY avgt 5 924.952 ± 39.864 MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm MEMORY avgt 5 29747.756 ± 336.204 B/op
MaxmindBenchmark.city_16:·gc.count MEMORY avgt 5 450.000 counts
MaxmindBenchmark.city_16:·gc.time MEMORY avgt 5 327.000 ms
MaxmindBenchmark.city_16 MEMORY_MAPPED avgt 5 30.094 ± 4.262 us/op
MaxmindBenchmark.city_16:·gc.alloc.rate MEMORY_MAPPED avgt 5 894.222 ± 123.418 MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm MEMORY_MAPPED avgt 5 29608.001 ± 0.001 B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods' MEMORY_MAPPED avgt 5 0.006 ± 0.004 MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm MEMORY_MAPPED avgt 5 0.209 ± 0.147 B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen MEMORY_MAPPED avgt 5 896.066 ± 122.754 MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm MEMORY_MAPPED avgt 5 29669.363 ± 257.249 B/op
MaxmindBenchmark.city_16:·gc.count MEMORY_MAPPED avgt 5 428.000 counts
MaxmindBenchmark.city_16:·gc.time MEMORY_MAPPED avgt 5 330.000 ms
@oschwald are we sure the benchmark is correct? Might be a better idea to use JMH or similar. Unless TRACE is on nothing happens to the variable t
. It can easily get escaped via escape analysis and the whole block disappearing. There could be other side effects. It also doesn't minimize what's getting tested - random
and inet
is part of the timing. These should be taken out, created first and stored in an arrray to bench this properly.
about thread safe. ByteBuffer
has two implementation DirectByteBuffer
and HeapByteBuffer
and let me show the get(int)
implementation
DirectByteBuffer
:
public byte get(int i) {
return ((unsafe.getByte(ix(checkIndex(i)))));
}
HeapByteBuffer
:
public byte get(int i) {
return hb[ix(checkIndex(i))];
}
where:
checkIndex
:
@HotSpotIntrinsicCandidate
final int checkIndex(int i) { // package-private
if ((i < 0) || (i >= limit))
throw new IndexOutOfBoundsException();
return i;
}
ix
:
protected int ix(int i) {
return i + offset;
}
I guess this code is thread safe, don't I?
I've made a force push with small changes that had some performance improvment.
Short summary:
MEMORY
mode;MaxmindBenchmark.city_16 MEMORY avgt 5 23.339 ± 0.558 us/op
MaxmindBenchmark.city_16:·gc.alloc.rate MEMORY avgt 5 1026.642 ± 24.477 MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm MEMORY avgt 5 26392.001 ± 0.001 B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods' MEMORY avgt 5 0.020 ± 0.006 MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm MEMORY avgt 5 0.523 ± 0.166 B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen MEMORY avgt 5 1029.409 ± 39.679 MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm MEMORY avgt 5 26462.662 ± 417.828 B/op
MaxmindBenchmark.city_16:·gc.count MEMORY avgt 5 641.000 counts
MaxmindBenchmark.city_16:·gc.time MEMORY avgt 5 427.000 ms
MaxmindBenchmark.city_16 MEMORY_MAPPED avgt 5 23.734 ± 0.477 us/op
MaxmindBenchmark.city_16:·gc.alloc.rate MEMORY_MAPPED avgt 5 1135.790 ± 22.987 MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm MEMORY_MAPPED avgt 5 29688.001 ± 0.001 B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods' MEMORY_MAPPED avgt 5 0.038 ± 0.004 MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm MEMORY_MAPPED avgt 5 0.987 ± 0.077 B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen MEMORY_MAPPED avgt 5 1136.946 ± 22.487 MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm MEMORY_MAPPED avgt 5 29718.294 ± 235.881 B/op
MaxmindBenchmark.city_16:·gc.count MEMORY_MAPPED avgt 5 543.000 counts
MaxmindBenchmark.city_16:·gc.time MEMORY_MAPPED avgt 5 410.000 ms
Even if a particular implementat of get()
is thread safe, we do not want to count on undocumented behavior. If we were willing to count on undocumented behavior, we could just assume that duplicate()
is thread safe given the constraint that the original ByteBuffer
is not written to. I don't think we can accept a PR that relies on undocumented thread safety assumptions.
This PR is likely slower as it causes an additional byte[]
allocation when creating strings.
@hc-codersatlas, some of those are valid concerns but they do not explain the repeatable slowdown with this branch.
@oschwald,
This PR is likely slower as it causes an additional
byte[]
allocation when creating strings.
Nope, it doesn't because UTF_8.newDecoder().decode(buffer)
calls inside CharBuffer.allocate(n)
that is very similar to new byte[n]
.
But last force push (16e58a7) contains a hack to overstep this behaviour.
I see you changed decodeString()
in the force push, but the original was:
private String decodeString(AtomicInteger offset, int size) throws CharacterCodingException {
byte[] bytes = getByteArray(offset, size);
return new String(bytes, UTF_8);
}
It is bytes
I was talking about. The new version improves the situation when the buffer is an array, but it doesn't help for the default memory-mapped mode, which is still much slower.
@oschwald OpenJDK/Oracle JDK uses one optimisation to get array directly from unsafe memory but it works only when array size less than some treshold. It depends on platform but is is 4-6-8 bytes.
At another cases (when string/array is longer) it will work as simple for loop.
Anyway, for default settings (with direct buffer) it will call a lot of get()
from bytebuffer
from both version of code.
So, this code has some performance improvement for one settings and I think it is good enough.
Given this has stalled and it still has the issue I raised, I am going to go ahead and close it.
Before this commit lookup to maxmind db:
Here I rewrited a bit decoder to make it thread safe and as result I removed duplicating of buffer.