Open ikeydoherty opened 10 years ago
Note from a create-group operation with buxtonctl (above was just running and doing noop):
4-linux.so)
==3746== by 0x40722F: allocate_tile.part.1 (hashmap.c:94)
==3746== by 0x407380: hashmap_new (hashmap.c:189)
==3746== by 0x404099: main (main.c:112)
==3746==
==3746== 159,744 bytes in 1 blocks are still reachable in loss record 4 of 4
==3746== at 0x4C2B6A6: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==3746== by 0x4E3800F: allocate_tile.part.1 (hashmap.c:94)
==3746== by 0x4E38160: hashmap_new (hashmap.c:189)
==3746== by 0x4E39150: setup_callbacks (protocol.c:66)
==3746== by 0x4E351B4: buxton_open (lbuxton.c:97)
==3746== by 0x404A4E: main (main.c:291)
==3746==
==3746== LEAK SUMMARY:
==3746== definitely lost: 0 bytes in 0 blocks
==3746== indirectly lost: 0 bytes in 0 blocks
==3746== possibly lost: 0 bytes in 0 blocks
==3746== still reachable: 376,832 bytes in 4 blocks
==3746== suppressed: 0 bytes in 0 blocks
==3746==
It is allocating quite large blocks to improve speed of use, avoiding hitting allocation early (remember it is from systemd's implementation so they'll not want to impact startup speed due to reallocs because they use hashmap everywhere).
Switching over to your hashmap would be the goal though I imagine. Is that in the TODO?
Eventual switch over or ensure correct alignment throughout the lifetime of this hashmap. Tbh I'd prefer a new one :) It's not in the TODO, want me to open up a quick change?
Added just in case ^^
You are welcome to push the TODO =).
Hah, you'd think I would remember I had push privs xD
So currently working on my resize code but I've gone with an aggressive cache hashmap implementation. Now obv. there are going to be doubling resizes in between, but I've checked with 5 million keys (ints) on an initial 75% capacity hashmap. As its aggressive with cache in mind its slightly heavier in terms of memory but very very fast (this will slow down a small bit when I introduce resizes though :P)
Loop-int tests
==28153==
==28153== HEAP SUMMARY:
==28153== in use at exit: 0 bytes in 0 blocks
==28153== total heap usage: 1,250,002 allocs, 1,250,002 frees, 160,000,024 bytes allocated
==28153==
==28153== All heap blocks were freed -- no leaks are possible
==28153==
==28153== For counts of detected and suppressed errors, rerun with: -v
==28153== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
ikey@evolveos ~l/hashmap
$ time ./run.sh
Test #1: 99/110
Returned: 110
Test #2: 98/136
Returned: 136
Test #3: 99/110
Returned: 110
Loop-int tests
real 0m0.497s
user 0m0.337s
sys 0m0.151s
On Sep 24, 2014 9:24 AM, "Ikey Doherty" notifications@github.com wrote:
So currently working on my resize code but I've gone with an aggressive cache hashmap implementation. Now obv. there are going to be doubling resizes in between, but I've checked with 5 million keys (ints) on an initial 75% capacity hashmap. As its aggressive with cache in mind its slightly heavier in terms of memory but very very fast (this will slow down a small bit when I introduce resizes though :P)
Lovely. How does the time compare to systemd's time?
Loop-int tests ==28153== ==28153== HEAP SUMMARY: ==28153== in use at exit: 0 bytes in 0 blocks ==28153== total heap usage: 1,250,002 allocs, 1,250,002 frees, 160,000,024 bytes allocated ==28153== ==28153== All heap blocks were freed -- no leaks are possible ==28153== ==28153== For counts of detected and suppressed errors, rerun with: -v ==28153== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2) ikey@evolveos ~l/hashmap $ time ./run.sh Test #1: 99/110 Returned: 110 Test #2: 98/136 Returned: 136 Test #3: 99/110 Returned: 110
Loop-int tests
real 0m0.497s user 0m0.337s sys 0m0.151s
— Reply to this email directly or view it on GitHub.
Putting appropriate tests together :) To be fair I need to do one where systemd's hashmap is initially resized the same, and then one with my implementation doing resizes, which is why there aren't more numbers yet :D
Simple test, systemd hashmap vs ikey-hashmap
Store 5 million integers in one loop. In next loop assert i == value restored. Highly trivial
systemd valgrind:
==1864== Command: ./a.out
==1864==
==1864==
==1864== HEAP SUMMARY:
==1864== in use at exit: 469,893,120 bytes in 15 blocks
==1864== total heap usage: 27 allocs, 12 frees, 686,720,352 bytes allocated
==1864==
==1864== 159,744 bytes in 1 blocks are still reachable in loss record 1 of 2
==1864== at 0x4C2B6A6: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==1864== by 0x4E34AE7: allocate_tile (hashmap.c:94)
==1864== by 0x4E34CBD: hashmap_new (hashmap.c:184)
==1864== by 0x4008FF: main (main.c:5)
==1864==
==1864== 469,733,376 bytes in 14 blocks are still reachable in loss record 2 of 2
==1864== at 0x4C2B6A6: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==1864== by 0x4E34AE7: allocate_tile (hashmap.c:94)
==1864== by 0x4E3574C: hashmap_put (hashmap.c:442)
==1864== by 0x400927: main (main.c:9)
==1864==
==1864== LEAK SUMMARY:
==1864== definitely lost: 0 bytes in 0 blocks
==1864== indirectly lost: 0 bytes in 0 blocks
==1864== possibly lost: 0 bytes in 0 blocks
==1864== still reachable: 469,893,120 bytes in 15 blocks
==1864== suppressed: 0 bytes in 0 blocks
==1864==
==1864== For counts of detected and suppressed errors, rerun with: -v
==1864== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
ikeymap valgrind:
==1958== HEAP SUMMARY:
==1958== in use at exit: 0 bytes in 0 blocks
==1958== total heap usage: 22 allocs, 22 frees, 346,730,056 bytes allocated
==1958==
==1958== All heap blocks were freed -- no leaks are possible
==1958==
==1958== For counts of detected and suppressed errors, rerun with: -v
==1958== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
systemd hashmap time (put/get)
ikey@evolveos ~/Intel/hashmap/systemd
$ time ./run.sh
real 0m1.158s
user 0m0.878s
sys 0m0.275s
ikeymap time (put/get):
ikey@evolveos ~/Intel/hashmap
$ time ./run.sh
Test #1: 99/110
Returned: 110
Test #2: 98/136
Returned: 136
Test #3: 99/110
Returned: 110
Loop-int tests
real 0m0.570s
user 0m0.343s
sys 0m0.225s
Basically mine wins :D I set them both to an initial capacity of 31 as found in systemd, though I prefer 61. Mine auto-resizes at 75%, and instead of memmove/memcpy voodoo we just reinsert into a fresh bigger map.
Tomorrow I'll do tests for strings (which will be slower due to different hash and strncmp, though to be fair I'll likely just use a simple DJB hash like systemd, meaning by nature I should win this one too :)
I also have some check-style asserts at the beginning of my own test, so mine is technically doing more get/puts for assurances, and we free that map and create a new one after =)
Looks good =).
Forgot to note; I switched to 4x not 2x, as systemd does, to be fair. I've also now added hash/compare functions for both string+"simple" operations. To make it dead-for-dead fair, I restricted the tests to the 5m comparisons of ints between both (identical now) and using the function pointers for comparison and hashing...
ikey@evolveos ~/Intel/hashmap
$ time ./run.sh
real 0m0.608s
user 0m0.410s
sys 0m0.197s
So, still not bad =)
Woo! Now just need to run it under perf and find out where these spend most of their time =).
Yeah :D So as I suspected function pointers did slow it down. Kinda unavoidable, really.
Internally this is what my hashmap looks like..
typedef struct HKVEntry {
HashKey key;
HashValue value;
struct HKVEntry *next;
struct HKVEntry *tail;
} HKVEntry;
struct BuxtonHashmap {
unsigned int n_buckets;
HKVEntry *entries;
unsigned int size;
bool resizing;
hash_compare_func compare;
hash_create_func create;
};
(typedefs for clarity) - biggest change is obviously its more cache aggressive, we allocate a whole bunch of buckets straight off, and not an array of pointers which we later populate on demand. Went with that method initially, too slow. Tracking the tail in instances of hash collisions (as we do in BuxtonList) proves quite helpful here too =)
I suspect our slowdowns come from traversal of shallow linked lists (75% "full rate" limits this quite a lot) but its actually something we'll hit mostly on freeing the list. Leaky example (no end call to buxton_hashmap_free)
ikey@evolveos ~/Intel/hashmap
$ time ./run.sh
real 0m0.565s
user 0m0.368s
sys 0m0.196s
No free's at all: (resizing)
$ time ./run.sh
real 0m0.546s
user 0m0.341s
sys 0m0.204s
The rest will come from get traversals
So I'd say thats probably the biggest bottleneck :)
Rightyo, well that isn't too surprising which is good =).
Just need to put a replacement iterator function in and this can be pushed to replace the old hashmap (with new checks, etc.)
OK 5m set/get then an iterator
systemd:
$ time ./run.sh
Size: 5000000
Count: 5000000
real 0m1.961s
user 0m1.615s
sys 0m0.340s
ikeymap
$ time ./run.sh
Loop-int tests
real 0m0.780s
user 0m0.574s
sys 0m0.205s
Where the validation is such:
int count = 0;
buxton_hashmap_foreach(map, iter, tkey, val) {
assert(UNHASH_KEY(tkey) == UNHASH_VALUE(val));
count += 1;
}
assert(count == NO_RUNS-1)
(Also pretty much identical for systemd.)
666 additions that commit :P I'll get the tests together and then send up a pull request, but you can take a nose now if you want while I get things together.
Just converted buxtonctl to BuxtonHashmap, and test results:
systemd hashmap:
==17500== HEAP SUMMARY:
==17500== in use at exit: 188,416 bytes in 2 blocks
==17500== total heap usage: 2 allocs, 0 frees, 188,416 bytes allocated
==17500==
==17500== LEAK SUMMARY:
==17500== definitely lost: 0 bytes in 0 blocks
==17500== indirectly lost: 0 bytes in 0 blocks
==17500== possibly lost: 0 bytes in 0 blocks
==17500== still reachable: 188,416 bytes in 2 blocks
==17500== suppressed: 0 bytes in 0 blocks
==17500== Rerun with --leak-check=full to see details of leaked memory
==17500==
==17500== For counts of detected and suppressed errors, rerun with: -v
==17500== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
BuxtonHashmap:
==2253== HEAP SUMMARY:
==2253== in use at exit: 0 bytes in 0 blocks
==2253== total heap usage: 5 allocs, 5 frees, 1,416 bytes allocated
==2253==
==2253== All heap blocks were freed -- no leaks are possible
==2253==
==2253== For counts of detected and suppressed errors, rerun with: -v
==2253== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 2 from 2)
I'll go ahead and start converting the rest in my repo.
Time to push this then?
@sofar There is an open issue afaik with the current implementation not currently passing the full test suite. This change is a nice to have though as it is to be used for search and not absolutely needed.
I shot my old implementation in the face. New one pushed to my repo as 7f5eae83e34057fffce73ac11db484b573a468c0 - working on extensive unit tests right now (currently my local non-make units are ~300 lines)
gdi but this will bum my OCD out no end.
==27818== HEAP SUMMARY:
==27818== in use at exit: 32 bytes in 1 blocks
==27818== total heap usage: 308 allocs, 307 frees, 446,459 bytes allocated
==27818==
==27818== 32 bytes in 1 blocks are still reachable in loss record 1 of 1
==27818== at 0x4C2B6A6: calloc (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==27818== by 0x575067F: _dlerror_run (dlerror.c:141)
==27818== by 0x57500C0: dlopen@@GLIBC_2.2.5 (dlopen.c:87)
==27818== by 0x40AC50: backend_for_layer (backend.c:171)
==27818== by 0x407A24: buxton_direct_get_value_for_layer (direct.c:152)
==27818== by 0x408221: buxton_direct_create_group (direct.c:431)
==27818== by 0x40552F: buxtond_handle_message (daemon.c:702)
==27818== by 0x405FEC: handle_client (daemon.c:1299)
==27818== by 0x403E89: main (main.c:363)
==27818==
==27818== LEAK SUMMARY:
==27818== definitely lost: 0 bytes in 0 blocks
==27818== indirectly lost: 0 bytes in 0 blocks
==27818== possibly lost: 0 bytes in 0 blocks
==27818== still reachable: 32 bytes in 1 blocks
==27818== suppressed: 0 bytes in 0 blocks
==27818==
==27818== For counts of detected and suppressed errors, rerun with: -v
==27818== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 2)
(Also my local gdbm version has issues internally, non related to buxton)
dlopen leaks on my glibc :(
Well time to submit a glibc patch =P.
Buh, running bxt_timing and getting worse results than master. :'(
get_string 25.613us 22.264us 0
get_string 24.914us 2.809us 0
Bottom result is old result. Sigh. Time to investigate.
Bottleneck is the hashmap resizing, like I suspected. Will optimize it..
Due to our use of hashmap, we still show a large amount of memory reachable upon program exit. Whilst some may argue it's not the biggest issue as its indeed not directly lost memory, it would seem to indicate that the existing hashmap implementation is too fast and loose with allocation.
I'll follow up with a solution should I find one, simply tracking this here.