jonhoo / bustle

A benchmarking harness for concurrent key-value collections
Apache License 2.0
115 stars 7 forks source link

Map feature comparison #2

Open xacrimon opened 4 years ago

xacrimon commented 4 years ago

Worth noting is some implementations provide different features and guarantees such as being able to share reference guards across threads which is useful for async. I think that all maps included in the benchmark should be compared in features and guarantees. This can be a short table and doesn't have to be any sizable writeup. Just something to help a user choose an implementation that provides the features and guarantees they need.

xacrimon commented 4 years ago

Also I should report this but the read and write guards that CHashMap uses are UB to send across threads but are marked Send + Sync. Well, file a rustsec advisory rather. CHashMap is unmaintained.

jonhoo commented 4 years ago

Ah, you mean if we start listing concurrent maps in the README on this repo? Yes, I agree that that should probably be noted!

xacrimon commented 4 years ago

Exactly.

xacrimon commented 4 years ago

I'm going to start to put together some benchmarks for different concurrent map implementations here https://git.nebulanet.cc/Acrimon/conc-map-bench and list differences.

xacrimon commented 4 years ago

Alright. I've written map adapters for a bunch of concurrent hashmaps and locked maps from std available at the git repository above. I have only ran this on my i7-7700HQ laptop and it doesn't look good for flurry. At 8 CPU's it's about equal to a RwLock'ed std HashMap. Something feels off with it.

jonhoo commented 4 years ago

That's really interesting — do you have plots anywhere (throughput on y, #cpus on x)? Are you using a git dependency? And crucially, are you re-using guards/HashMapRefs, or are you getting a new guard/ref for every operation?

jonhoo commented 4 years ago

@domenicquirl ^

xacrimon commented 4 years ago

I haven't made plots yet since I haven't made any automatic system for it. Going to make some plots soon. You can trivially replicate this though by cloning the repository and running cargo bench. I am using the released version from crates.io. I am not reusing guards since that probably isn't what your average user will do. I do not reuse them for the contrie concurrent map crate either. And it manages much much better than flurry. So reusing guards is clearly not the issue.

domenicquirl commented 4 years ago

I didn't recently run our benchmarks since I'm currently in the middle of the tree bins, but if I recall correctly the last time I checked (after the Moved optimization), there was a significant difference when re-using guards (on my machine).

Now, we probably should have reasonable performance when guards are not re-used, which already seemed like it was not so much the case from our own benchmarks. Running the mentioned bustle benchmarks locally, flurry performed the worst from all maps by far, up to an order of magnitude worse. I'm surprised you saw comparable performance to RwLock<HashMap> if you didn't re-use guards.

Out of interest, I added an adapter loosely based on our bustle branch which holds one guard for each entire run. With this, local performance becomes very much competitive with the other maps. While we can certainly debate about what the default behaviour of our users will be (or, for that matter, what API we should mainly expose), I conjecture that guards are indeed an issue. I do not know about how or why exactly this differs from Contrie.

xacrimon commented 4 years ago

Hmm alright. Contrie seems to pin on every operation and is very competitive. I made these benchmarks after how the average user is likely to use the map which is what matters and what should be optimized for. So yeah, Flurry probably needs a good amount of love to improve the speed when guards aren't reused.

jonhoo commented 4 years ago

My guess is that we (for some reason) generate significantly more garbage than contrie. That shouldn't be the case, and I don't think there's a fundamental reason why we would, but it would explain that difference. Guards essentially become more expensive proportionally to how frequently you generate garbage.

@xacrimon I just released flurry 0.2 to crates.io, which should have the Moved optimization which makes a significant difference. Probably still not enough to "fix" everything, but at least it makes headway.

xacrimon commented 4 years ago

That seems like a wierd EBR implementation if that happens. Going to update the benchmark and run it on my desktop today then. I am not super familiar with crossbeam-epoch though.

domenicquirl commented 4 years ago

Worth noting that bustle initializes the maps with a reasonably high initial capacity by default if I understand correctly, so there are likely not that many moves occurring that would create Moved to be optimized.

xacrimon commented 4 years ago

My benchmark initializes with 35M keys and only about 24M entries will be inside at the peak.

On Mon, Mar 2, 2020, 23:08 DQ notifications@github.com wrote:

Worth noting that bustle initializes the maps with a reasonably high initial capacity by default if I understand correctly, so there are likely not that many moves occurring that would create Moved to be optimized.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jonhoo/bustle/issues/2?email_source=notifications&email_token=AFANDB2DQRG64EKR7K57PPDRFQU7PA5CNFSM4K5OWG2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENRFZEQ#issuecomment-593648786, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFANDB6U3VNKBVWYHUDDCGDRFQU7PANCNFSM4K5OWG2A .

xacrimon commented 4 years ago

Updating flurry to 0.2 helps a bit but it is nowhere near as fast as others.

jonhoo commented 4 years ago

When you say "as fast", what are you measuring? I am more interested in the scaling as the number of threads go up than the performance at lower core counts. The latter is just sort of "constant" overhead that can be fixed, whereas the former indicates a contention problem that may indicate an algorithmic problem with the datastructure.

xacrimon commented 4 years ago

Contrie and the DashMap variants scale better per thread. They see close to ideal scaling while flurry seems to have a flatter scaling. Atleast on my machine.

jonhoo commented 4 years ago

Interesting.. I mean, it could be that Java's ConcurrentHashMap just doesn't scale that well :p In some sense, that was part of what I wanted to find out by porting it. It seems more likely that the garbage collection is at fault though, and that we need to dig into why that doesn't scale for flurry.

domenicquirl commented 4 years ago

I agree. Just to have some exemplary numbers, here's the insert-heavy workload after the version bump locally:

-- Contrie
25165824 operations across 1 thread(s) in 17.4047412s; time/op = 691ns
25165824 operations across 2 thread(s) in 9.1185288s; time/op = 361ns
25165824 operations across 3 thread(s) in 5.0007643s; time/op = 198ns
25165824 operations across 4 thread(s) in 3.9588848s; time/op = 157ns
25165824 operations across 5 thread(s) in 3.675621s; time/op = 145ns
25165824 operations across 6 thread(s) in 3.5683745s; time/op = 141ns
25165824 operations across 7 thread(s) in 3.5062784s; time/op = 139ns
25165824 operations across 8 thread(s) in 3.5366413s; time/op = 140ns

-- Flurry
25165824 operations across 1 thread(s) in 48.9622421s; time/op = 1.945µs
25165824 operations across 2 thread(s) in 33.5371731s; time/op = 1.332µs
25165824 operations across 3 thread(s) in 21.5174031s; time/op = 854ns
25165824 operations across 4 thread(s) in 17.6734076s; time/op = 701ns
25165824 operations across 5 thread(s) in 15.4645651s; time/op = 614ns
25165824 operations across 6 thread(s) in 14.0445763s; time/op = 557ns
25165824 operations across 7 thread(s) in 13.3347464s; time/op = 529ns
25165824 operations across 8 thread(s) in 12.8026602s; time/op = 507ns

-- DashMapV3
25165824 operations across 1 thread(s) in 3.4451529s; time/op = 136ns
25165824 operations across 2 thread(s) in 2.0076125s; time/op = 79ns
25165824 operations across 3 thread(s) in 1.5233589s; time/op = 59ns
25165824 operations across 4 thread(s) in 1.2849448s; time/op = 50ns
25165824 operations across 5 thread(s) in 1.204213s; time/op = 47ns
25165824 operations across 6 thread(s) in 1.0978293s; time/op = 42ns
25165824 operations across 7 thread(s) in 1.0250471s; time/op = 39ns
25165824 operations across 8 thread(s) in 959.9066ms; time/op = 38ns

And flurry with only one guard:

25165824 operations across 1 thread(s) in 8.1154286s; time/op = 321ns
25165824 operations across 2 thread(s) in 7.1846062s; time/op = 285ns
25165824 operations across 3 thread(s) in 8.3408941s; time/op = 330ns
25165824 operations across 4 thread(s) in 2.8672222s; time/op = 113ns
25165824 operations across 5 thread(s) in 2.152728s; time/op = 85ns
25165824 operations across 6 thread(s) in 1.9390374s; time/op = 76ns
25165824 operations across 7 thread(s) in 1.7498945s; time/op = 68ns
25165824 operations across 8 thread(s) in 1.5876595s; time/op = 62ns
jonhoo commented 4 years ago

Yeah, that certainly looks a lot like we're generating some unnecessary garbage that ends up hampering scalability. It also suggests that crossbeam-epoch does not handle large amounts of garbage produced across many cores very well, giving us even more of an incentive to avoid generating it in the first place. I'm curious whether the flurry model fundamentally requires more allocations (and thus garbage) than contrie, or whether it is just a matter of the code not being sufficiently carefully written.

xacrimon commented 4 years ago

Contrie looks like it's doing about one allocation per insert in the average case. It uses no caching object pool either. I haven't studied the flurry code closely but it is definently allocating more. On another note, I discovered an issue in bustle. It instantly panics on prefill due to an erroneous assert. I also need to add prefill to the read and update workloads in my suite since that changes performance significantly due to probing schemes and a deeper tree in contrie.

jonhoo commented 4 years ago

Interesting, what assert is erroneous? The code is pretty closely modeled after the libcuckoo benchmark, including the asserts.

I'm not sure I follow why you would need to prefill differently in the read/update workloads. The prefill should always be happening regardless of the underlying workload.

xacrimon commented 4 years ago

The assert at line 309 I believe. The default prefill is 0. If I raise that to 0.5 I get dramatically different performance. After some metrics gathering dashmap is doing a lot more probing than without explicit prefill.

jonhoo commented 4 years ago

That assert should definitely be correct as long as the assumption that the keys are distinct as the documentation specifies. The whole benchmark relies on that assumption.

The default prefill should be 75% of initial capacity, not 0..?

xacrimon commented 4 years ago
pub fn new(threads: usize, mix: Mix) -> Self {
        Self {
            mix,
            initial_cap_log2: 25,
            prefill_f: 0.0,
            ops_f: 0.75,
            threads,
            seed: None,
        }
    }
xacrimon commented 4 years ago

The benchmark runs fine if prefill is off.

jonhoo commented 4 years ago

Oh, you're right, I was thinking of ops_f, not prefill_f. And yes, looks like the assertion is inverted. Will fix that shortly!

jonhoo commented 4 years ago

Fix published in 0.4.

xacrimon commented 4 years ago

Tests don't lie :P. Thanks for fixing it on a short notice though. Much appreciated. I will update the benchmarks. Prefill more accurately models real world scenarios. So it only makes sense to use it.

jonhoo commented 4 years ago

:p

It's not clear to me that "real world" scenarios will always be prefilled. I think it makes sense to test both. A common use-case for example is to have a bunch of threads collectively fill up some shared map with data as they process, in which case there really is no prefill phase. That's when things like cooperative resizing (which ConcurrentHashMap has) becomes important.

jonhoo commented 4 years ago

One of these days when the benchmarks and implementations settle a bit, I'll also run the whole suite on a bunch of many-core machines I have hardware access to so we get some more data to go on.

xacrimon commented 4 years ago

Sounds good. I have lots of work left on them. I'll notify you.

domenicquirl commented 4 years ago

I'm curious whether the flurry model fundamentally requires more allocations (and thus garbage) than contrie, or whether it is just a matter of the code not being sufficiently carefully written.

It does feel like the Java code in general doesn't care too much about garbage. There is a piece of code in transfer where a "list" of tree nodes is constructed, then the number of elements is checked and if it's below the threshold this list is immediately untreeified again (which discards the tree nodes and creates regular ones). The Moved case would be another example.

jonhoo commented 4 years ago

Very true! It may be that we have to be smarter about garbage in the Rust case. Or maybe use some kind of arena/bump allocator for often-reused heap values.

xacrimon commented 4 years ago

Or we could fix crossbeam-epoch.

jonhoo commented 4 years ago

That too!

xacrimon commented 4 years ago

Also keep in mind you can free memory normally without going through crossbeam-epoch as long as the pointer hasn't been published to a structure other threads have access too. I've seen this mistake a few times.

jonhoo commented 4 years ago

The only place I can imagine that being relevant is this allocation which may be dropped (without being shared) if it turns out we are aborting the operation early (like here or here), or replacing an existing value. In all those cases though, I believe we just drop an Owned directly, which doesn't go through the garbage collection.

xacrimon commented 4 years ago

Small status update. I'm still working on getting DashMap v4 out and also working on the benchmarks. The current pandemic has halted this work somewhat but I am still making progress.

xacrimon commented 4 years ago

Updated the bench and ran a test on the read heavy preset on my i7-7700HQ with dashmap being of the unreleased experimental version. Here's a graph.

jonhoo commented 4 years ago

That y-axis had me pretty confused for a second. Lower is better is odd, but I guess it makes sense here. Is this with the newly released flurry 0.3? And I'm guessing this is still with pinning a new guard for every read?

xacrimon commented 4 years ago

Ah I haven't seen flurry 0.3. Yup, a pin on every op, just like all other libs in the bench. I will update to 0.3 and rerun today.

jonhoo commented 4 years ago

What mix are you running with above? Are there any writes, or is it read-only following the population of the map?

xacrimon commented 4 years ago

This mix is 95% read, 3% insert, 1% update and 1% remove.

jonhoo commented 4 years ago

I'd personally also be curious about a read-only workload. Mostly because I want to figure out exactly what it is that's causing flurry pain in these benchmarks. My guess is that it is the epoch garbage collection (and not the pinning itself), but it's hard to say without doing more digging.

xacrimon commented 4 years ago

Ah, I'll do a read only benchmark too and I will also collect some profiling data and send it so we can figure out what is eating time.

xacrimon commented 4 years ago

I think you were right. Here is a flamegraph of the entire benchmark. Mix was the same as before. The flurry part is on the mix -> FlurryTable slab in the graph. All dependencies are updated.

jonhoo commented 4 years ago

Yeah, that certainly doesn't look great. Seems like almost all the time is spent on the garbage collection. Which is surprising to me since reads shouldn't generate garbage. This is why I think a read-only mix might be instructive, to see if crossbeam-epoch also has this overhead when no garbage is generated (like if it's just from tracking the epochs). Also cc @domenicquirl. I wonder if it might be worth raising this as an issue on crossbeam-epoch with this image: 2020-04-20-092109_1920x1080_scrot

xacrimon commented 4 years ago

Something is up. I put together a read only benchmark for flurry and it seems to generate gigabytes of garbage in seconds on my machine. So you are definitively generating a lot of garbage. Tracking epochs is pretty fast, contrie also uses crossbeam-epoch and is within 3x of DashMap. I get pretty much the same graph as above in my read only benchmark.

jonhoo commented 4 years ago

That's crazy, and should definitely not be happening. In a read-only benchmark, we shouldn't be generating any garbage. I'm currently kind of swamped, but maybe @domenicquirl can take a look?