Don't copy buffer with file for each lookup

catap commented 6 years ago

Before this commit lookup to maxmind db:

had synchronized call that was a bottle neck for multithread application;
copied bytebuffer for each lookup that made life of GC much worst.

Here I rewrited a bit decoder to make it thread safe and as result I removed duplicating of buffer.

coveralls commented 6 years ago

Coverage increased (+2.7%) to 90.706% when pulling 16e58a78a60f1b33d293f2afef4d53a6b1a5f053 on catap:reduce_allocations into f98064a9702105dff39f8966a1b2bc0272d41c57 on maxmind:master.

oschwald commented 6 years ago

I think there may be a misunderstanding. ByteBuffer.duplicate() only creates a copy of the ByteBuffer. It does not create a copy of the underlying data. Do you have any profiling to show that this change significantly impacts the number of allocations? Similarly, as duplicate() is such a light operation (creating a tiny object) compared to everything else that the reader does, I am a bit skeptical that this is much of a win for multithreaded applications.

The ByteBuffer get*() methods are not documented as thread safe, which is why the BufferHolder was added in the first place.

oschwald commented 6 years ago

Edit: I re-ran the branch benchmark a few times and took the best one.

This also appears significantly slower than master for me using the benchmark in the repo.

Master:

No caching
Warming up
Requests per second: 133651
Requests per second: 155788
Requests per second: 146648

Benchmarking
Requests per second: 157624
Requests per second: 161356
Requests per second: 164758
Requests per second: 165471
Requests per second: 163246

With caching
Warming up
Requests per second: 672478
Requests per second: 820981
Requests per second: 819900

Benchmarking
Requests per second: 816492
Requests per second: 837111
Requests per second: 837186
Requests per second: 834656
Requests per second: 827388

This branch:

No caching
Warming up
Requests per second: 91969
Requests per second: 91193
Requests per second: 97935

Benchmarking
Requests per second: 103758
Requests per second: 103199
Requests per second: 102987
Requests per second: 102629
Requests per second: 103557

With caching
Warming up
Requests per second: 546258
Requests per second: 598943
Requests per second: 602783

Benchmarking
Requests per second: 602465
Requests per second: 608707
Requests per second: 606630
Requests per second: 608863
Requests per second: 604225

catap commented 6 years ago

@oschwald well, I made my benchmark over OpenJDK's JMH.

I run it as

java -jar benchmark/target/benchmarks.jar -t 1 -f 1 -tu us -bm avgt -prof gc -jvmArgs -DmaxmindFile=$(pwd)/GeoIP2-City.mmdb MaxmindBenchmark.city_16

on

JDK 9.0.4, Java HotSpot(TM) 64-Bit Server VM, 9.0.4+11

Short summary: my benchmark confirm that it is slower (30us vs 25us) but it uses less memory (1.2Gb/sec vs 0.9Gb/sec and costs 2 times less GC cycles)

About performance costs I'm on this right now.

And you mentored a good point about thread safing of ByteBuffer.get(..). I expect that it is (because it is read without any side effects) but I haven't got any proof.

master

Benchmark                                                                        (mode)  Mode  Cnt      Score     Error   Units
MaxmindBenchmark.city_16                                                         MEMORY  avgt    5     25.248 ±   2.455   us/op
MaxmindBenchmark.city_16:·gc.alloc.rate                                          MEMORY  avgt    5   1139.718 ± 108.498  MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm                                     MEMORY  avgt    5  31680.001 ±   0.001    B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'              MEMORY  avgt    5      0.015 ±   0.019  MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm         MEMORY  avgt    5      0.416 ±   0.514    B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen                                    MEMORY  avgt    5   1141.999 ± 112.903  MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm                               MEMORY  avgt    5  31742.894 ± 338.623    B/op
MaxmindBenchmark.city_16:·gc.count                                               MEMORY  avgt    5    822.000            counts
MaxmindBenchmark.city_16:·gc.time                                                MEMORY  avgt    5    616.000                ms
MaxmindBenchmark.city_16                                                  MEMORY_MAPPED  avgt    5     23.931 ±   2.005   us/op
MaxmindBenchmark.city_16:·gc.alloc.rate                                   MEMORY_MAPPED  avgt    5   1203.032 ±  98.991  MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm                              MEMORY_MAPPED  avgt    5  31696.001 ±   0.001    B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'       MEMORY_MAPPED  avgt    5      0.015 ±   0.002  MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm  MEMORY_MAPPED  avgt    5      0.384 ±   0.046    B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen                             MEMORY_MAPPED  avgt    5   1206.010 ± 100.458  MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm                        MEMORY_MAPPED  avgt    5  31774.357 ± 272.890    B/op
MaxmindBenchmark.city_16:·gc.count                                        MEMORY_MAPPED  avgt    5    576.000            counts
MaxmindBenchmark.city_16:·gc.time                                         MEMORY_MAPPED  avgt    5    449.000                ms

this branch:

Benchmark                                                                        (mode)  Mode  Cnt      Score     Error   Units
MaxmindBenchmark.city_16                                                         MEMORY  avgt    5     29.200 ±   1.106   us/op
MaxmindBenchmark.city_16:·gc.alloc.rate                                          MEMORY  avgt    5    923.087 ±  35.062  MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm                                     MEMORY  avgt    5  29688.001 ±   0.001    B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'              MEMORY  avgt    5      0.009 ±   0.004  MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm         MEMORY  avgt    5      0.279 ±   0.144    B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen                                    MEMORY  avgt    5    924.952 ±  39.864  MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm                               MEMORY  avgt    5  29747.756 ± 336.204    B/op
MaxmindBenchmark.city_16:·gc.count                                               MEMORY  avgt    5    450.000            counts
MaxmindBenchmark.city_16:·gc.time                                                MEMORY  avgt    5    327.000                ms
MaxmindBenchmark.city_16                                                  MEMORY_MAPPED  avgt    5     30.094 ±   4.262   us/op
MaxmindBenchmark.city_16:·gc.alloc.rate                                   MEMORY_MAPPED  avgt    5    894.222 ± 123.418  MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm                              MEMORY_MAPPED  avgt    5  29608.001 ±   0.001    B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'       MEMORY_MAPPED  avgt    5      0.006 ±   0.004  MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm  MEMORY_MAPPED  avgt    5      0.209 ±   0.147    B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen                             MEMORY_MAPPED  avgt    5    896.066 ± 122.754  MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm                        MEMORY_MAPPED  avgt    5  29669.363 ± 257.249    B/op
MaxmindBenchmark.city_16:·gc.count                                        MEMORY_MAPPED  avgt    5    428.000            counts
MaxmindBenchmark.city_16:·gc.time                                         MEMORY_MAPPED  avgt    5    330.000                ms

re-thc commented 6 years ago

@oschwald are we sure the benchmark is correct? Might be a better idea to use JMH or similar. Unless TRACE is on nothing happens to the variable t. It can easily get escaped via escape analysis and the whole block disappearing. There could be other side effects. It also doesn't minimize what's getting tested - random and inet is part of the timing. These should be taken out, created first and stored in an arrray to bench this properly.

catap commented 6 years ago

about thread safe. ByteBuffer has two implementation DirectByteBuffer and HeapByteBuffer and let me show the get(int) implementation

DirectByteBuffer:

public byte get(int i) {
    return ((unsafe.getByte(ix(checkIndex(i)))));
}

HeapByteBuffer:

public byte get(int i) {
    return hb[ix(checkIndex(i))];
}

where:

checkIndex:

@HotSpotIntrinsicCandidate
final int checkIndex(int i) {                       // package-private
    if ((i < 0) || (i >= limit))
        throw new IndexOutOfBoundsException();
    return i;
}

ix:

protected int ix(int i) {
return i + offset;
}

I guess this code is thread safe, don't I?

catap commented 6 years ago

I've made a force push with small changes that had some performance improvment.

Short summary:

old decreasing of allocation rate was related to performance degradation, right now I see it ;
new code has (25us -> 23us) improvement for MEMORY mode;
new code has decreased GC count and GC time.

MaxmindBenchmark.city_16                                                         MEMORY  avgt    5     23.339 ±   0.558   us/op
MaxmindBenchmark.city_16:·gc.alloc.rate                                          MEMORY  avgt    5   1026.642 ±  24.477  MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm                                     MEMORY  avgt    5  26392.001 ±   0.001    B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'              MEMORY  avgt    5      0.020 ±   0.006  MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm         MEMORY  avgt    5      0.523 ±   0.166    B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen                                    MEMORY  avgt    5   1029.409 ±  39.679  MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm                               MEMORY  avgt    5  26462.662 ± 417.828    B/op
MaxmindBenchmark.city_16:·gc.count                                               MEMORY  avgt    5    641.000            counts
MaxmindBenchmark.city_16:·gc.time                                                MEMORY  avgt    5    427.000                ms
MaxmindBenchmark.city_16                                                  MEMORY_MAPPED  avgt    5     23.734 ±   0.477   us/op
MaxmindBenchmark.city_16:·gc.alloc.rate                                   MEMORY_MAPPED  avgt    5   1135.790 ±  22.987  MB/sec
MaxmindBenchmark.city_16:·gc.alloc.rate.norm                              MEMORY_MAPPED  avgt    5  29688.001 ±   0.001    B/op
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'       MEMORY_MAPPED  avgt    5      0.038 ±   0.004  MB/sec
MaxmindBenchmark.city_16:·gc.churn.CodeHeap_'non-profiled_nmethods'.norm  MEMORY_MAPPED  avgt    5      0.987 ±   0.077    B/op
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen                             MEMORY_MAPPED  avgt    5   1136.946 ±  22.487  MB/sec
MaxmindBenchmark.city_16:·gc.churn.G1_Old_Gen.norm                        MEMORY_MAPPED  avgt    5  29718.294 ± 235.881    B/op
MaxmindBenchmark.city_16:·gc.count                                        MEMORY_MAPPED  avgt    5    543.000            counts
MaxmindBenchmark.city_16:·gc.time                                         MEMORY_MAPPED  avgt    5    410.000                ms

oschwald commented 6 years ago

Even if a particular implementat of get() is thread safe, we do not want to count on undocumented behavior. If we were willing to count on undocumented behavior, we could just assume that duplicate() is thread safe given the constraint that the original ByteBuffer is not written to. I don't think we can accept a PR that relies on undocumented thread safety assumptions.

This PR is likely slower as it causes an additional byte[] allocation when creating strings.

@hc-codersatlas, some of those are valid concerns but they do not explain the repeatable slowdown with this branch.

catap commented 6 years ago

@oschwald,

This PR is likely slower as it causes an additional byte[] allocation when creating strings.

Nope, it doesn't because UTF_8.newDecoder().decode(buffer) calls inside CharBuffer.allocate(n) that is very similar to new byte[n].

But last force push (16e58a7) contains a hack to overstep this behaviour.

oschwald commented 6 years ago

I see you changed decodeString() in the force push, but the original was:

   private String decodeString(AtomicInteger offset, int size) throws CharacterCodingException {
       byte[] bytes = getByteArray(offset, size);
       return new String(bytes, UTF_8);
    }

It is bytes I was talking about. The new version improves the situation when the buffer is an array, but it doesn't help for the default memory-mapped mode, which is still much slower.

catap commented 6 years ago

@oschwald OpenJDK/Oracle JDK uses one optimisation to get array directly from unsafe memory but it works only when array size less than some treshold. It depends on platform but is is 4-6-8 bytes.

At another cases (when string/array is longer) it will work as simple for loop.

Anyway, for default settings (with direct buffer) it will call a lot of get() from bytebuffer from both version of code.

So, this code has some performance improvement for one settings and I think it is good enough.

oschwald commented 5 years ago

Given this has stalled and it still has the issue I raised, I am going to go ahead and close it.

maxmind / MaxMind-DB-Reader-java

Don't copy buffer with file for each lookup #37