jonhoo / bustle

A benchmarking harness for concurrent key-value collections
Apache License 2.0
115 stars 7 forks source link

Map feature comparison #2

Open xacrimon opened 4 years ago

xacrimon commented 4 years ago

Worth noting is some implementations provide different features and guarantees such as being able to share reference guards across threads which is useful for async. I think that all maps included in the benchmark should be compared in features and guarantees. This can be a short table and doesn't have to be any sizable writeup. Just something to help a user choose an implementation that provides the features and guarantees they need.

domenicquirl commented 4 years ago

I'm in a similar situation at the moment I fear. I've had some forced free time due to Covid19, but now everything is heading back online, except more chaotic. I've been following the discussion through mail, this does look wrong. If this is an issue with us and not crossbeam, it's very likely the reason for a lot of the performance discrepancy. Depending on how much progress I get with other things during the week I'll take a look on the weekend, I'm definitely interested in finding out where this comes from.

domenicquirl commented 4 years ago

Finally found some time to look at this. I was interested most in the "gigabytes of garbage" for a start, so I ran valgrind's DHAT on a small test with two concurrent threads reading from a pre-initialized map.

To check, I first had the test use only one guard per thread, and I got only the initial creation of the map as significant allocations (plus some allocations from the test binary), split into the fast and the slow path in HashMap::put. Reading itself (HashMap::get) did not re-allocate anything.

Then I switched to calling HashMap::guard for every get. The same allocations exist for map initialization, but a whole lot of allocations are now caused by creating and dropping guards. Raw data for the two DHAT nodes:

AP 1.1.1.1.1.1/2 (3 children) {
  Total:     4,148,144 bytes (48.26%, 30,566.74/Minstr) in 2,002 blocks (29.02%, 14.75/Minstr), avg size 2,072 bytes, avg lifetime 267,613.74 instrs (0.2% of program duration)
  At t-gmax: 4,144 bytes (2.29%) in 2 blocks (0.08%), avg size 2,072 bytes
  At t-end:  10,360 bytes (59.84%) in 5 blocks (55.56%), avg size 2,072 bytes
  Reads:     4,351,992 bytes (39.87%, 32,068.85/Minstr), 1.05/byte
  Writes:    4,276,264 bytes (31.25%, 31,510.82/Minstr), 1.03/byte
  Allocated at {
    ^1: 0x48397B3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_dhat-amd64-linux.so)
    ^2: 0x28C80B: alloc::alloc::alloc (alloc.rs:80)
    ^3: 0x28CD53: <alloc::alloc::Global as core::alloc::AllocRef>::alloc (alloc.rs:174)
    ^4: 0x28C76C: alloc::alloc::exchange_malloc (alloc.rs:268)
    ^5: 0x2864B9: new<crossbeam_epoch::sync::queue::Node<crossbeam_epoch::internal::SealedBag>> (boxed.rs:174)
    ^6: 0x2864B9: crossbeam_epoch::atomic::Owned<T>::new (atomic.rs:664)
    ^7: 0x28DB25: crossbeam_epoch::sync::queue::Queue<T>::push (queue.rs:91)
    ^8: 0x28372D: crossbeam_epoch::internal::Global::push_bag (internal.rs:269)
    #9: 0x284CE0: crossbeam_epoch::internal::Local::finalize (internal.rs:576)
    #10: 0x284762: crossbeam_epoch::internal::Local::unpin (internal.rs:514)
    #11: 0x28C6C3: <crossbeam_epoch::guard::Guard as core::ops::drop::Drop>::drop (guard.rs:423)
    #12: 0x287F79: core::ptr::drop_in_place (mod.rs:178)
  }
}

AP 1.1.1.2/2 (3 children) {
  Total:     4,212,208 bytes (49%, 31,038.81/Minstr) in 2,002 blocks (29.02%, 14.75/Minstr), avg size 2,104 bytes, avg lifetime 202,202.7 instrs (0.15% of program duration)
  At t-gmax: 6,312 bytes (3.49%) in 3 blocks (0.12%), avg size 2,104 bytes
  At t-end:  6,312 bytes (36.46%) in 3 blocks (33.33%), avg size 2,104 bytes
  Reads:     5,028,320 bytes (46.06%, 37,052.56/Minstr), 1.19/byte
  Writes:    8,968,904 bytes (65.54%, 66,089.83/Minstr), 2.13/byte
  Allocated at {
    ^1: 0x48397B3: malloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_dhat-amd64-linux.so)
    ^2: 0x28C80B: alloc::alloc::alloc (alloc.rs:80)
    ^3: 0x28CD53: <alloc::alloc::Global as core::alloc::AllocRef>::alloc (alloc.rs:174)
    ^4: 0x28C76C: alloc::alloc::exchange_malloc (alloc.rs:268)
    #5: 0x286439: new<crossbeam_epoch::internal::Local> (boxed.rs:174)
    #6: 0x286439: crossbeam_epoch::atomic::Owned<T>::new (atomic.rs:664)
    #7: 0x283D9C: crossbeam_epoch::internal::Local::register (internal.rs:388)
    #8: 0x28C1CD: crossbeam_epoch::collector::Collector::register (collector.rs:39)
    #9: 0x1F8455: flurry::map::HashMap<K,V,S>::guard (map.rs:369)
  }
} 

Note that absolute numbers are in relation to 1000 map elements, so 2000 calls to get across both threads.

xacrimon commented 4 years ago

I might actually know why this happens. Guards may on drop trigger a flush of the local garbage list and it looks like it doesn't check if it's empty and swaps it regardless.

xacrimon commented 4 years ago

Sometimes it's also moved onto the global queue which allocates a new thread local bag of garbage.

jonhoo commented 4 years ago

Wow, that would be really bad. This smells like a bug if it allocates even if there is no garbage...

domenicquirl commented 4 years ago

I'm also a bit confused that we spend so much time/memory allocating epoch LocalHandles. I know that we have a collector tied to each map so we can't just use epoch::pin() against the global collector's local handle (as Contrie seems to do from a quick look at their repo). But do we need to register a new handle every time we pin? You added most of the collector safety measures I think Jon, so you probably have a better idea about that than I do, but it seems like a lot. At least the default handle is re-used all the time, so I don't understand what these handles would be useful for if you cannot do the same in other contexts as well.

xacrimon commented 4 years ago

Pretty sure you don't need to grab new handles.

jonhoo commented 4 years ago

At least the default handle is re-used all the time

Hmm, that doesn't seem right. LocalHandle isn't Send/Sync, so you can't really re-use it except in a thread-local. It's true that we could cache the handle in a thread local ourselves, though we'd have to be very careful to make sure it works correctly even if the user is accessing multiple maps on the same thread. The idea behind HashMap::pin was that the application can choose how they wish to cache the handles, rather than us doing it for them. That does give a Guard, not a LocalHandle though, so maybe we're missing an intermediate there?

xacrimon commented 4 years ago

Why are we using a custom collector anyway? I don't see a benefit compared to the global static one.

domenicquirl commented 4 years ago

It is in TLS, yeah, but that's where the epoch::pin() calls go. So we can't have just one handle per map. But accessing different maps should be covered by checking the guard passed to all the public interface methods, shouldn't it? Since locals are tied to their collector, you couldn't guards from one local accessing a different map (@xacrimon this is also why we have different collectors, see https://github.com/jonhoo/flurry/blob/master/src/map.rs#L3541). This would mean you'd need to have a local per map per thread however. That is for continuing to hand out guards. We could hand out the local, but the performance impact is so big if you then use guards, I still think we should aspire to have the "regular" case with guards perform reasonably well and not rely on the user to choose the correct API to be efficient.

jonhoo commented 4 years ago

Why are we using a custom collector anyway? I don't see a benefit compared to the global static one.

The biggest reasons is for memory reclamation. By using a collector, you can guarantee that all memory is dropped when the data structure is dropped. This also, in theory, can allow the removal of the 'static bound on keys and values, which only needs to be there when using the global collector.

But accessing different maps should be covered by checking the guard passed to all the public interface methods, shouldn't it?

Are you thinking that if the guard doesn't match, then we update the thread local? That could work... Cleanup will get complicated though, because you need to also clear that thread local, or we'll potentially hold on to remaining garbage forever. It could also lead to really weird performance profiles, where if you do lookups into two maps, one after the other, then you'd get the slow path every time.

the performance impact is so big if you then use guards, I still think we should aspire to have the "regular" case with guards perform reasonably well and not rely on the user to choose the correct API to be efficient.

I'm not sure I fully follow what you're saying here. I agree that ideally performance should be good "by default", without the user having to do anything weird. I don't quite know how we do this without either moving back to just using the global allocator, or doing the aforementioned "optimistic caching" of the LocalHandle.

domenicquirl commented 4 years ago

I was only thinking so far as that the check would cover the case where you try to access with a guard from a wrong local. I don't think I like the idea of caching only for one map, a pattern like

map1.get(key1, guard1);
map2.get(key2, guard2);
map2.get(key3, guard3);

is common enough that this would be a bad idea for the reasons you outlined.

Is there a good way to store multiple handles in TLS dynamically, so we could store locals for each map separately for each thread? Independently, a third option would be to make HashMapRef the default (/only) way to interact with a map, which has the advantage that we could get rid of the checks altogether since we always know the guard comes from our map.

Ideally, probably crossbeam would store the thread for which a local gets registered together with the local itself, and then check if it needs to create a new one or if an existing one can be reused?

jonhoo commented 4 years ago

I was only thinking so far as that the check would cover the case where you try to access with a guard from a wrong local.

Yup, the checks would indeed catch this, but the user would probably be quite confused if their use sometimes ended up with an error due to caching.

I don't think I like the idea of caching only for one map, a pattern like ... is common enough that this would be a bad idea for the reasons you outlined.

Completely agree.

Is there a good way to store multiple handles in TLS dynamically

We'd have to store a HashMap in the thread local. Which is doable, but also a little awkward.\

Independently, a third option would be to make HashMapRef the default (/only) way to interact with a map

This would get rid of the check, true, but it would not solve the problem — users would still then either have to manually keep the HashMapRef around, or they would be calling HashMap::pin over and over, resulting in the same slowdown as today.

Ideally, probably crossbeam would store the thread for which a local gets registered together with the local itself, and then check if it needs to create a new one or if an existing one can be reused?

Hmm, this would just make LocalHandle Send, but I'm not sure it would otherwise solve the issue?

I wonder what the picture looks like if a LocalHandle didn't sync empty changes. I think that could have a huge impact. A simple allocation usually doesn't matter that much.

If it turns out that it really is creating a LocalHandle that causes the issue, then my take here is that we either need to commit fully to the global allocator, or do a thread-local cache, or simply tell the user that they should re-use HashMapRef if they want performance. None of these are ideal.

domenicquirl commented 4 years ago

We'd have to store a HashMap in the thread local. Which is doable, but also a little awkward.

Why would we want to have the entire map be local to each thread? Then where would the common data be?

Hmm, this would just make LocalHandle Send, but I'm not sure it would otherwise solve the issue?

I'm not thinking about moving handles of safety checks, but about registering local handles and obtaining guards. If we had a list of locals with associated thread, then if a thread wants to obtain a new guard we could check that list for the given thread to see if we already registered a local for this thread. If so, we use that local to pin and get a guard, only otherwise we actually register a new local with the collector.

I wonder what the picture looks like if a LocalHandle didn't sync empty changes. I think that could have a huge impact. A simple allocation usually doesn't matter that much.

I agree it is likely that has more performance impact. However, given that having to do this "simple allocation" for every read accounted for almost half of the programs memory usage, it's still something I would like to avoid. Especially if having a new handle for each guard() is not necessary.

jonhoo commented 4 years ago

Why would we want to have the entire map be local to each thread?

Ah, sorry, that wasn't what I meant. What I meant was that you'd store a std::collections::HashMap<SomeFlurryMapIdentifier, LocalHandle> in TLS.

domenicquirl commented 4 years ago

Oh, ok. Then it's just the other way round from me thinking about having something like a std::collections::HashMap<ThreadId, LocalHandle> in our map, or in the crossbeam Global.

jonhoo commented 4 years ago

Oh, no, that won't work, since LocalHandle isn't Send.

domenicquirl commented 4 years ago

Not directly. There is a list of locals in the Global, they are added when you register handles (accessed through an Arc). These are internal Local objects, not LocalHandles, and they live behind a Shared, but it should be possible to have a thread Id there. Seeing that, maybe this could then be used in such a way, but I'm also not familiar enough with crossbeam's implementation to actually know that.

jonhoo commented 4 years ago

Oh, I see, you're proposing changing the implementation of crossbeam-epoch! I was thinking of ways to make changes in flurry specifically. May be worth raising this on the crossbeam issue tracker actually, for a "how do we do this". And specifically call out the issue around synchronizing on empty.

xacrimon commented 4 years ago

I've updated deps for the benchmark now and I have changed flurry to use hashmapref handles. Since I've released a v4 dashmap release candidate I believe implementations should be pretty stable. So when you can you've got the green light from me to run them.

domenicquirl commented 4 years ago

Did either of you actually get back to crossbeam on this? I did a bit more digging with different variations of global/local collector and guard source checks. (I even tried to put together a version using flize over crossbeam, but couldn't find a good solution for the epoch::unprotected() cases.) https://github.com/jonhoo/flurry/blob/9a443cbd393c1605eb6df302c0d7531d70eb7be6/src/map.rs#L369 (obtaining guards) is by far the biggest difference-maker in overall benchmarking results. Basically no matter the surrounding implementation, if at this spot we do epoch::pin() we are very much competitive in everything but single-thread in-/upserts. In all cases where this goes through some_collector.register(), we end up with the absurd performance difference we have seen already.

domenicquirl commented 4 years ago

Also @xacrimon I tried also running the updated benchmark, but some dependency has a cmake build script that wouldn't run under windows. Didn't matter for my evaluations and I have a Linux machine I can switch to when I want to look at this more, but wanted to mention in case you are not yet aware.

jonhoo commented 4 years ago

@domenicquirl Interesting, flize doesn't have a way to (unsafely) get a reference when you know no-one else has one? That seems odd.