Hashmap cleaning should be more thorough

ikeydoherty commented 10 years ago

Due to our use of hashmap, we still show a large amount of memory reachable upon program exit. Whilst some may argue it's not the biggest issue as its indeed not directly lost memory, it would seem to indicate that the existing hashmap implementation is too fast and loose with allocation.

==3701== 28,672 bytes in 1 blocks are still reachable in loss record 1 of 2
==3701==    at 0x4C2B6A6: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==3701==    by 0x40722F: allocate_tile.part.1 (hashmap.c:94)
==3701==    by 0x4075C6: hashmap_put (hashmap.c:446)
==3701==    by 0x404100: main (main.c:120)
==3701== 
==3701== 159,744 bytes in 1 blocks are still reachable in loss record 2 of 2
==3701==    at 0x4C2B6A6: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==3701==    by 0x40722F: allocate_tile.part.1 (hashmap.c:94)
==3701==    by 0x407380: hashmap_new (hashmap.c:189)
==3701==    by 0x404099: main (main.c:112)
==3701== 
==3701== LEAK SUMMARY:
==3701==    definitely lost: 0 bytes in 0 blocks
==3701==    indirectly lost: 0 bytes in 0 blocks
==3701==      possibly lost: 0 bytes in 0 blocks
==3701==    still reachable: 188,416 bytes in 2 blocks
==3701==         suppressed: 0 bytes in 0 blocks

I'll follow up with a solution should I find one, simply tracking this here.

ikeydoherty commented 10 years ago

Note from a create-group operation with buxtonctl (above was just running and doing noop):

4-linux.so)
==3746==    by 0x40722F: allocate_tile.part.1 (hashmap.c:94)
==3746==    by 0x407380: hashmap_new (hashmap.c:189)
==3746==    by 0x404099: main (main.c:112)
==3746== 
==3746== 159,744 bytes in 1 blocks are still reachable in loss record 4 of 4
==3746==    at 0x4C2B6A6: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==3746==    by 0x4E3800F: allocate_tile.part.1 (hashmap.c:94)
==3746==    by 0x4E38160: hashmap_new (hashmap.c:189)
==3746==    by 0x4E39150: setup_callbacks (protocol.c:66)
==3746==    by 0x4E351B4: buxton_open (lbuxton.c:97)
==3746==    by 0x404A4E: main (main.c:291)
==3746== 
==3746== LEAK SUMMARY:
==3746==    definitely lost: 0 bytes in 0 blocks
==3746==    indirectly lost: 0 bytes in 0 blocks
==3746==      possibly lost: 0 bytes in 0 blocks
==3746==    still reachable: 376,832 bytes in 4 blocks
==3746==         suppressed: 0 bytes in 0 blocks
==3746==

bryteise commented 10 years ago

It is allocating quite large blocks to improve speed of use, avoiding hitting allocation early (remember it is from systemd's implementation so they'll not want to impact startup speed due to reallocs because they use hashmap everywhere).

bryteise commented 10 years ago

Switching over to your hashmap would be the goal though I imagine. Is that in the TODO?

ikeydoherty commented 10 years ago

Eventual switch over or ensure correct alignment throughout the lifetime of this hashmap. Tbh I'd prefer a new one :) It's not in the TODO, want me to open up a quick change?

ikeydoherty commented 10 years ago

Added just in case ^^

bryteise commented 10 years ago

You are welcome to push the TODO =).

ikeydoherty commented 10 years ago

Hah, you'd think I would remember I had push privs xD

ikeydoherty commented 10 years ago

So currently working on my resize code but I've gone with an aggressive cache hashmap implementation. Now obv. there are going to be doubling resizes in between, but I've checked with 5 million keys (ints) on an initial 75% capacity hashmap. As its aggressive with cache in mind its slightly heavier in terms of memory but very very fast (this will slow down a small bit when I introduce resizes though :P)

Loop-int tests
==28153== 
==28153== HEAP SUMMARY:
==28153==     in use at exit: 0 bytes in 0 blocks
==28153==   total heap usage: 1,250,002 allocs, 1,250,002 frees, 160,000,024 bytes allocated
==28153== 
==28153== All heap blocks were freed -- no leaks are possible
==28153== 
==28153== For counts of detected and suppressed errors, rerun with: -v
==28153== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
ikey@evolveos ~l/hashmap
$ time ./run.sh 
Test #1: 99/110
Returned: 110
Test #2: 98/136
Returned: 136
Test #3: 99/110
Returned: 110

Loop-int tests

real    0m0.497s
user    0m0.337s
sys 0m0.151s

bryteise commented 10 years ago

On Sep 24, 2014 9:24 AM, "Ikey Doherty" notifications@github.com wrote:

So currently working on my resize code but I've gone with an aggressive cache hashmap implementation. Now obv. there are going to be doubling resizes in between, but I've checked with 5 million keys (ints) on an initial 75% capacity hashmap. As its aggressive with cache in mind its slightly heavier in terms of memory but very very fast (this will slow down a small bit when I introduce resizes though :P)

Lovely. How does the time compare to systemd's time?

Loop-int tests ==28153== ==28153== HEAP SUMMARY: ==28153== in use at exit: 0 bytes in 0 blocks ==28153== total heap usage: 1,250,002 allocs, 1,250,002 frees, 160,000,024 bytes allocated ==28153== ==28153== All heap blocks were freed -- no leaks are possible ==28153== ==28153== For counts of detected and suppressed errors, rerun with: -v ==28153== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2) ikey@evolveos ~l/hashmap $ time ./run.sh Test #1: 99/110 Returned: 110 Test #2: 98/136 Returned: 136 Test #3: 99/110 Returned: 110

Loop-int tests

real 0m0.497s user 0m0.337s sys 0m0.151s

— Reply to this email directly or view it on GitHub.

ikeydoherty commented 10 years ago

Putting appropriate tests together :) To be fair I need to do one where systemd's hashmap is initially resized the same, and then one with my implementation doing resizes, which is why there aren't more numbers yet :D

ikeydoherty commented 10 years ago

Simple test, systemd hashmap vs ikey-hashmap

Store 5 million integers in one loop. In next loop assert i == value restored. Highly trivial

systemd valgrind:

==1864== Command: ./a.out
==1864== 
==1864== 
==1864== HEAP SUMMARY:
==1864==     in use at exit: 469,893,120 bytes in 15 blocks
==1864==   total heap usage: 27 allocs, 12 frees, 686,720,352 bytes allocated
==1864== 
==1864== 159,744 bytes in 1 blocks are still reachable in loss record 1 of 2
==1864==    at 0x4C2B6A6: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==1864==    by 0x4E34AE7: allocate_tile (hashmap.c:94)
==1864==    by 0x4E34CBD: hashmap_new (hashmap.c:184)
==1864==    by 0x4008FF: main (main.c:5)
==1864== 
==1864== 469,733,376 bytes in 14 blocks are still reachable in loss record 2 of 2
==1864==    at 0x4C2B6A6: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==1864==    by 0x4E34AE7: allocate_tile (hashmap.c:94)
==1864==    by 0x4E3574C: hashmap_put (hashmap.c:442)
==1864==    by 0x400927: main (main.c:9)
==1864== 
==1864== LEAK SUMMARY:
==1864==    definitely lost: 0 bytes in 0 blocks
==1864==    indirectly lost: 0 bytes in 0 blocks
==1864==      possibly lost: 0 bytes in 0 blocks
==1864==    still reachable: 469,893,120 bytes in 15 blocks
==1864==         suppressed: 0 bytes in 0 blocks
==1864== 
==1864== For counts of detected and suppressed errors, rerun with: -v
==1864== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)

ikeymap valgrind:

==1958== HEAP SUMMARY:
==1958==     in use at exit: 0 bytes in 0 blocks
==1958==   total heap usage: 22 allocs, 22 frees, 346,730,056 bytes allocated
==1958== 
==1958== All heap blocks were freed -- no leaks are possible
==1958== 
==1958== For counts of detected and suppressed errors, rerun with: -v
==1958== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)

systemd hashmap time (put/get)

ikey@evolveos ~/Intel/hashmap/systemd
$ time ./run.sh 

real    0m1.158s
user    0m0.878s
sys 0m0.275s

ikeymap time (put/get):

ikey@evolveos ~/Intel/hashmap
$ time ./run.sh 
Test #1: 99/110
Returned: 110
Test #2: 98/136
Returned: 136
Test #3: 99/110
Returned: 110

Loop-int tests

real    0m0.570s
user    0m0.343s
sys 0m0.225s

Basically mine wins :D I set them both to an initial capacity of 31 as found in systemd, though I prefer 61. Mine auto-resizes at 75%, and instead of memmove/memcpy voodoo we just reinsert into a fresh bigger map.

Tomorrow I'll do tests for strings (which will be slower due to different hash and strncmp, though to be fair I'll likely just use a simple DJB hash like systemd, meaning by nature I should win this one too :)

I also have some check-style asserts at the beginning of my own test, so mine is technically doing more get/puts for assurances, and we free that map and create a new one after =)

bryteise commented 10 years ago

Looks good =).

ikeydoherty commented 10 years ago

Forgot to note; I switched to 4x not 2x, as systemd does, to be fair. I've also now added hash/compare functions for both string+"simple" operations. To make it dead-for-dead fair, I restricted the tests to the 5m comparisons of ints between both (identical now) and using the function pointers for comparison and hashing...

ikey@evolveos ~/Intel/hashmap
$ time ./run.sh 

real    0m0.608s
user    0m0.410s
sys 0m0.197s

So, still not bad =)

bryteise commented 10 years ago

Woo! Now just need to run it under perf and find out where these spend most of their time =).

ikeydoherty commented 10 years ago

Yeah :D So as I suspected function pointers did slow it down. Kinda unavoidable, really.

Internally this is what my hashmap looks like..

typedef struct HKVEntry {
    HashKey key;
    HashValue value;
    struct HKVEntry *next;
    struct HKVEntry *tail;
} HKVEntry;

struct BuxtonHashmap {
    unsigned int n_buckets;

    HKVEntry *entries;
    unsigned int size;
    bool resizing;

    hash_compare_func compare;
    hash_create_func create;
};

(typedefs for clarity) - biggest change is obviously its more cache aggressive, we allocate a whole bunch of buckets straight off, and not an array of pointers which we later populate on demand. Went with that method initially, too slow. Tracking the tail in instances of hash collisions (as we do in BuxtonList) proves quite helpful here too =)

I suspect our slowdowns come from traversal of shallow linked lists (75% "full rate" limits this quite a lot) but its actually something we'll hit mostly on freeing the list. Leaky example (no end call to buxton_hashmap_free)

ikey@evolveos ~/Intel/hashmap
$ time ./run.sh 

real    0m0.565s
user    0m0.368s
sys 0m0.196s

No free's at all: (resizing)

$ time ./run.sh 

real    0m0.546s
user    0m0.341s
sys 0m0.204s

The rest will come from get traversals

So I'd say thats probably the biggest bottleneck :)

bryteise commented 10 years ago

Rightyo, well that isn't too surprising which is good =).

ikeydoherty commented 10 years ago

Just need to put a replacement iterator function in and this can be pushed to replace the old hashmap (with new checks, etc.)

ikeydoherty commented 10 years ago

OK 5m set/get then an iterator

systemd:

$ time ./run.sh 
Size: 5000000
Count: 5000000

real    0m1.961s
user    0m1.615s
sys 0m0.340s

ikeymap

$ time ./run.sh 

Loop-int tests

real    0m0.780s
user    0m0.574s
sys 0m0.205s

Where the validation is such:

        int count = 0;
        buxton_hashmap_foreach(map, iter, tkey, val) {
            assert(UNHASH_KEY(tkey) == UNHASH_VALUE(val));
            count += 1;
        }
        assert(count == NO_RUNS-1)

(Also pretty much identical for systemd.)

ikeydoherty commented 10 years ago

666 additions that commit :P I'll get the tests together and then send up a pull request, but you can take a nose now if you want while I get things together.

ikeydoherty commented 10 years ago

Just converted buxtonctl to BuxtonHashmap, and test results:

systemd hashmap:

==17500== HEAP SUMMARY:
==17500==     in use at exit: 188,416 bytes in 2 blocks
==17500==   total heap usage: 2 allocs, 0 frees, 188,416 bytes allocated
==17500== 
==17500== LEAK SUMMARY:
==17500==    definitely lost: 0 bytes in 0 blocks
==17500==    indirectly lost: 0 bytes in 0 blocks
==17500==      possibly lost: 0 bytes in 0 blocks
==17500==    still reachable: 188,416 bytes in 2 blocks
==17500==         suppressed: 0 bytes in 0 blocks
==17500== Rerun with --leak-check=full to see details of leaked memory
==17500== 
==17500== For counts of detected and suppressed errors, rerun with: -v
==17500== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)

BuxtonHashmap:

==2253== HEAP SUMMARY:
==2253==     in use at exit: 0 bytes in 0 blocks
==2253==   total heap usage: 5 allocs, 5 frees, 1,416 bytes allocated
==2253== 
==2253== All heap blocks were freed -- no leaks are possible
==2253== 
==2253== For counts of detected and suppressed errors, rerun with: -v
==2253== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)

I'll go ahead and start converting the rest in my repo.

sofar commented 10 years ago

Time to push this then?

bryteise commented 10 years ago

@sofar There is an open issue afaik with the current implementation not currently passing the full test suite. This change is a nice to have though as it is to be used for search and not absolutely needed.

ikeydoherty commented 9 years ago

I shot my old implementation in the face. New one pushed to my repo as 7f5eae83e34057fffce73ac11db484b573a468c0 - working on extensive unit tests right now (currently my local non-make units are ~300 lines)

ikeydoherty commented 9 years ago

gdi but this will bum my OCD out no end.

==27818== HEAP SUMMARY:
==27818==     in use at exit: 32 bytes in 1 blocks
==27818==   total heap usage: 308 allocs, 307 frees, 446,459 bytes allocated
==27818== 
==27818== 32 bytes in 1 blocks are still reachable in loss record 1 of 1
==27818==    at 0x4C2B6A6: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==27818==    by 0x575067F: _dlerror_run (dlerror.c:141)
==27818==    by 0x57500C0: dlopen@@GLIBC_2.2.5 (dlopen.c:87)
==27818==    by 0x40AC50: backend_for_layer (backend.c:171)
==27818==    by 0x407A24: buxton_direct_get_value_for_layer (direct.c:152)
==27818==    by 0x408221: buxton_direct_create_group (direct.c:431)
==27818==    by 0x40552F: buxtond_handle_message (daemon.c:702)
==27818==    by 0x405FEC: handle_client (daemon.c:1299)
==27818==    by 0x403E89: main (main.c:363)
==27818== 
==27818== LEAK SUMMARY:
==27818==    definitely lost: 0 bytes in 0 blocks
==27818==    indirectly lost: 0 bytes in 0 blocks
==27818==      possibly lost: 0 bytes in 0 blocks
==27818==    still reachable: 32 bytes in 1 blocks
==27818==         suppressed: 0 bytes in 0 blocks
==27818== 
==27818== For counts of detected and suppressed errors, rerun with: -v
==27818== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 2)

(Also my local gdbm version has issues internally, non related to buxton)

dlopen leaks on my glibc :(

bryteise commented 9 years ago

Well time to submit a glibc patch =P.

ikeydoherty commented 9 years ago

Buh, running bxt_timing and getting worse results than master. :'(

get_string                    25.613us      22.264us           0

get_string                    24.914us       2.809us           0

Bottom result is old result. Sigh. Time to investigate.

ikeydoherty commented 9 years ago

Bottleneck is the hashmap resizing, like I suspected. Will optimize it..

intel / buxton

Hashmap cleaning should be more thorough #124