Open xacrimon opened 4 years ago
Also I should report this but the read and write guards that CHashMap uses are UB to send across threads but are marked Send + Sync. Well, file a rustsec advisory rather. CHashMap is unmaintained.
Ah, you mean if we start listing concurrent maps in the README on this repo? Yes, I agree that that should probably be noted!
Exactly.
I'm going to start to put together some benchmarks for different concurrent map implementations here https://git.nebulanet.cc/Acrimon/conc-map-bench and list differences.
Alright. I've written map adapters for a bunch of concurrent hashmaps and locked maps from std available at the git repository above. I have only ran this on my i7-7700HQ laptop and it doesn't look good for flurry. At 8 CPU's it's about equal to a RwLock'ed std HashMap. Something feels off with it.
That's really interesting — do you have plots anywhere (throughput on y, #cpus on x)? Are you using a git
dependency? And crucially, are you re-using guards/HashMapRef
s, or are you getting a new guard/ref for every operation?
@domenicquirl ^
I haven't made plots yet since I haven't made any automatic system for it. Going to make some plots soon. You can trivially replicate this though by cloning the repository and running cargo bench
. I am using the released version from crates.io. I am not reusing guards since that probably isn't what your average user will do. I do not reuse them for the contrie concurrent map crate either. And it manages much much better than flurry. So reusing guards is clearly not the issue.
I didn't recently run our benchmarks since I'm currently in the middle of the tree bins, but if I recall correctly the last time I checked (after the Moved
optimization), there was a significant difference when re-using guards (on my machine).
Now, we probably should have reasonable performance when guards are not re-used, which already seemed like it was not so much the case from our own benchmarks. Running the mentioned bustle benchmarks locally, flurry performed the worst from all maps by far, up to an order of magnitude worse. I'm surprised you saw comparable performance to RwLock<HashMap>
if you didn't re-use guards.
Out of interest, I added an adapter loosely based on our bustle branch which holds one guard for each entire run. With this, local performance becomes very much competitive with the other maps. While we can certainly debate about what the default behaviour of our users will be (or, for that matter, what API we should mainly expose), I conjecture that guards are indeed an issue. I do not know about how or why exactly this differs from Contrie.
Hmm alright. Contrie seems to pin on every operation and is very competitive. I made these benchmarks after how the average user is likely to use the map which is what matters and what should be optimized for. So yeah, Flurry probably needs a good amount of love to improve the speed when guards aren't reused.
My guess is that we (for some reason) generate significantly more garbage than contrie. That shouldn't be the case, and I don't think there's a fundamental reason why we would, but it would explain that difference. Guards essentially become more expensive proportionally to how frequently you generate garbage.
@xacrimon I just released flurry 0.2
to crates.io, which should have the Moved
optimization which makes a significant difference. Probably still not enough to "fix" everything, but at least it makes headway.
That seems like a wierd EBR implementation if that happens. Going to update the benchmark and run it on my desktop today then. I am not super familiar with crossbeam-epoch though.
Worth noting that bustle initializes the maps with a reasonably high initial capacity by default if I understand correctly, so there are likely not that many moves occurring that would create Moved
to be optimized.
My benchmark initializes with 35M keys and only about 24M entries will be inside at the peak.
On Mon, Mar 2, 2020, 23:08 DQ notifications@github.com wrote:
Worth noting that bustle initializes the maps with a reasonably high initial capacity by default if I understand correctly, so there are likely not that many moves occurring that would create Moved to be optimized.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jonhoo/bustle/issues/2?email_source=notifications&email_token=AFANDB2DQRG64EKR7K57PPDRFQU7PA5CNFSM4K5OWG2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENRFZEQ#issuecomment-593648786, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFANDB6U3VNKBVWYHUDDCGDRFQU7PANCNFSM4K5OWG2A .
Updating flurry to 0.2 helps a bit but it is nowhere near as fast as others.
When you say "as fast", what are you measuring? I am more interested in the scaling as the number of threads go up than the performance at lower core counts. The latter is just sort of "constant" overhead that can be fixed, whereas the former indicates a contention problem that may indicate an algorithmic problem with the datastructure.
Contrie and the DashMap variants scale better per thread. They see close to ideal scaling while flurry seems to have a flatter scaling. Atleast on my machine.
Interesting.. I mean, it could be that Java's ConcurrentHashMap
just doesn't scale that well :p In some sense, that was part of what I wanted to find out by porting it. It seems more likely that the garbage collection is at fault though, and that we need to dig into why that doesn't scale for flurry.
I agree. Just to have some exemplary numbers, here's the insert-heavy workload after the version bump locally:
-- Contrie
25165824 operations across 1 thread(s) in 17.4047412s; time/op = 691ns
25165824 operations across 2 thread(s) in 9.1185288s; time/op = 361ns
25165824 operations across 3 thread(s) in 5.0007643s; time/op = 198ns
25165824 operations across 4 thread(s) in 3.9588848s; time/op = 157ns
25165824 operations across 5 thread(s) in 3.675621s; time/op = 145ns
25165824 operations across 6 thread(s) in 3.5683745s; time/op = 141ns
25165824 operations across 7 thread(s) in 3.5062784s; time/op = 139ns
25165824 operations across 8 thread(s) in 3.5366413s; time/op = 140ns
-- Flurry
25165824 operations across 1 thread(s) in 48.9622421s; time/op = 1.945µs
25165824 operations across 2 thread(s) in 33.5371731s; time/op = 1.332µs
25165824 operations across 3 thread(s) in 21.5174031s; time/op = 854ns
25165824 operations across 4 thread(s) in 17.6734076s; time/op = 701ns
25165824 operations across 5 thread(s) in 15.4645651s; time/op = 614ns
25165824 operations across 6 thread(s) in 14.0445763s; time/op = 557ns
25165824 operations across 7 thread(s) in 13.3347464s; time/op = 529ns
25165824 operations across 8 thread(s) in 12.8026602s; time/op = 507ns
-- DashMapV3
25165824 operations across 1 thread(s) in 3.4451529s; time/op = 136ns
25165824 operations across 2 thread(s) in 2.0076125s; time/op = 79ns
25165824 operations across 3 thread(s) in 1.5233589s; time/op = 59ns
25165824 operations across 4 thread(s) in 1.2849448s; time/op = 50ns
25165824 operations across 5 thread(s) in 1.204213s; time/op = 47ns
25165824 operations across 6 thread(s) in 1.0978293s; time/op = 42ns
25165824 operations across 7 thread(s) in 1.0250471s; time/op = 39ns
25165824 operations across 8 thread(s) in 959.9066ms; time/op = 38ns
And flurry with only one guard:
25165824 operations across 1 thread(s) in 8.1154286s; time/op = 321ns
25165824 operations across 2 thread(s) in 7.1846062s; time/op = 285ns
25165824 operations across 3 thread(s) in 8.3408941s; time/op = 330ns
25165824 operations across 4 thread(s) in 2.8672222s; time/op = 113ns
25165824 operations across 5 thread(s) in 2.152728s; time/op = 85ns
25165824 operations across 6 thread(s) in 1.9390374s; time/op = 76ns
25165824 operations across 7 thread(s) in 1.7498945s; time/op = 68ns
25165824 operations across 8 thread(s) in 1.5876595s; time/op = 62ns
Yeah, that certainly looks a lot like we're generating some unnecessary garbage that ends up hampering scalability. It also suggests that crossbeam-epoch does not handle large amounts of garbage produced across many cores very well, giving us even more of an incentive to avoid generating it in the first place. I'm curious whether the flurry model fundamentally requires more allocations (and thus garbage) than contrie, or whether it is just a matter of the code not being sufficiently carefully written.
Contrie looks like it's doing about one allocation per insert in the average case. It uses no caching object pool either. I haven't studied the flurry code closely but it is definently allocating more. On another note, I discovered an issue in bustle. It instantly panics on prefill due to an erroneous assert. I also need to add prefill to the read and update workloads in my suite since that changes performance significantly due to probing schemes and a deeper tree in contrie.
Interesting, what assert is erroneous? The code is pretty closely modeled after the libcuckoo benchmark, including the asserts.
I'm not sure I follow why you would need to prefill differently in the read/update workloads. The prefill should always be happening regardless of the underlying workload.
The assert at line 309 I believe. The default prefill is 0. If I raise that to 0.5 I get dramatically different performance. After some metrics gathering dashmap is doing a lot more probing than without explicit prefill.
That assert should definitely be correct as long as the assumption that the keys are distinct as the documentation specifies. The whole benchmark relies on that assumption.
The default prefill should be 75% of initial capacity, not 0..?
pub fn new(threads: usize, mix: Mix) -> Self {
Self {
mix,
initial_cap_log2: 25,
prefill_f: 0.0,
ops_f: 0.75,
threads,
seed: None,
}
}
The benchmark runs fine if prefill is off.
Oh, you're right, I was thinking of ops_f
, not prefill_f
. And yes, looks like the assertion is inverted. Will fix that shortly!
Fix published in 0.4.
Tests don't lie :P. Thanks for fixing it on a short notice though. Much appreciated. I will update the benchmarks. Prefill more accurately models real world scenarios. So it only makes sense to use it.
:p
It's not clear to me that "real world" scenarios will always be prefilled. I think it makes sense to test both. A common use-case for example is to have a bunch of threads collectively fill up some shared map with data as they process, in which case there really is no prefill phase. That's when things like cooperative resizing (which ConcurrentHashMap
has) becomes important.
One of these days when the benchmarks and implementations settle a bit, I'll also run the whole suite on a bunch of many-core machines I have hardware access to so we get some more data to go on.
Sounds good. I have lots of work left on them. I'll notify you.
I'm curious whether the flurry model fundamentally requires more allocations (and thus garbage) than contrie, or whether it is just a matter of the code not being sufficiently carefully written.
It does feel like the Java code in general doesn't care too much about garbage. There is a piece of code in transfer
where a "list" of tree nodes is constructed, then the number of elements is checked and if it's below the threshold this list is immediately untreeified again (which discards the tree nodes and creates regular ones). The Moved
case would be another example.
Very true! It may be that we have to be smarter about garbage in the Rust case. Or maybe use some kind of arena/bump allocator for often-reused heap values.
Or we could fix crossbeam-epoch.
That too!
Also keep in mind you can free memory normally without going through crossbeam-epoch as long as the pointer hasn't been published to a structure other threads have access too. I've seen this mistake a few times.
The only place I can imagine that being relevant is this allocation which may be dropped (without being shared) if it turns out we are aborting the operation early (like here or here), or replacing an existing value. In all those cases though, I believe we just drop
an Owned
directly, which doesn't go through the garbage collection.
Small status update. I'm still working on getting DashMap v4 out and also working on the benchmarks. The current pandemic has halted this work somewhat but I am still making progress.
Updated the bench and ran a test on the read heavy preset on my i7-7700HQ with dashmap being of the unreleased experimental version. Here's a graph.
That y-axis had me pretty confused for a second. Lower is better is odd, but I guess it makes sense here. Is this with the newly released flurry 0.3
? And I'm guessing this is still with pinning a new guard for every read?
Ah I haven't seen flurry 0.3. Yup, a pin on every op, just like all other libs in the bench. I will update to 0.3 and rerun today.
What mix are you running with above? Are there any writes, or is it read-only following the population of the map?
This mix is 95% read, 3% insert, 1% update and 1% remove.
I'd personally also be curious about a read-only workload. Mostly because I want to figure out exactly what it is that's causing flurry pain in these benchmarks. My guess is that it is the epoch garbage collection (and not the pinning itself), but it's hard to say without doing more digging.
Ah, I'll do a read only benchmark too and I will also collect some profiling data and send it so we can figure out what is eating time.
I think you were right. Here is a flamegraph of the entire benchmark. Mix was the same as before. The flurry part is on the mix -> FlurryTable slab in the graph. All dependencies are updated.
Yeah, that certainly doesn't look great. Seems like almost all the time is spent on the garbage collection. Which is surprising to me since reads shouldn't generate garbage. This is why I think a read-only mix might be instructive, to see if crossbeam-epoch
also has this overhead when no garbage is generated (like if it's just from tracking the epochs). Also cc @domenicquirl. I wonder if it might be worth raising this as an issue on crossbeam-epoch
with this image:
Something is up. I put together a read only benchmark for flurry and it seems to generate gigabytes of garbage in seconds on my machine. So you are definitively generating a lot of garbage. Tracking epochs is pretty fast, contrie also uses crossbeam-epoch
and is within 3x of DashMap. I get pretty much the same graph as above in my read only benchmark.
That's crazy, and should definitely not be happening. In a read-only benchmark, we shouldn't be generating any garbage. I'm currently kind of swamped, but maybe @domenicquirl can take a look?
Worth noting is some implementations provide different features and guarantees such as being able to share reference guards across threads which is useful for async. I think that all maps included in the benchmark should be compared in features and guarantees. This can be a short table and doesn't have to be any sizable writeup. Just something to help a user choose an implementation that provides the features and guarantees they need.