Consider adding `Arc`/atomic reference counting instead of eagerly cloning

Hi! 👋

I'm coming here from https://github.com/GregoryConrad/rearch-rs/issues/11. I noticed a 6.4% increase in my project's writes by simply wrapping the values in an Arc in the hash map (data in my situation is immutable). Thus, I came up with a few ideas that may be worth investigating:

Always use Arc for keys (no exposed methods to mutate keys iirc), and drop the Clone requirement on keys. This one should be pretty simple?
Always use Arc for values. Issue here is methods like get_mut that need to mutate the inner value...A solution here would be to gate get_mut and any other mut methods behind an impl<K, V: Clone> HashMap<K, V> {} so you can internally cow the Arc's inner value and then provide that copy to the user.

This would provide a couple benefits:

No longer require Clone for general usage
Provide speed improvements for free
Access all current methods without any breaking changes (get_mut will require Clone in this proposal, but the entire data structure currently requires Clone, so I don't think it would break anything?)

Please let me know your thoughts!

Edit: I suppose it is worth mentioning that 6.4% speed increase clearly depends upon the Clone implementation in question, but I believe in almost all, if not all, Arc provides faster clones than memory copies?

Hi! 👋

I'm coming here from GregoryConrad/rearch-rs#11. I noticed a 6.4% increase in my project's writes by simply wrapping the values in an Arc in the hash map (data in my situation is immutable). Thus, I came up with a few ideas that may be worth investigating:
1. Always use `Arc` for keys (no exposed methods to mutate keys iirc), and drop the `Clone` requirement on keys. This one should be pretty simple?

The problem here is that if you're key is something like a u64, then Arc is a lot of overhead. In your case, the reason we require Clone is because then you can self-wrap your key type in Arc as you determine it to be needed :)

2. Always use `Arc` for values. Issue here is methods like `get_mut` that need to mutate the inner value...A solution here would be to gate `get_mut` and any other `mut` methods behind an `impl<K, V: Clone> HashMap<K, V> {}` so you can internally cow the `Arc`'s inner value and then provide that copy to the user.

This one is a bit more compelling, mostly because the values are more likely to be "complex" types. HashMap <u64, u64> is rare after all.

But it's also similar to the above, couldn't the user DIY this by wrapping their own value in Arc instead?

This would provide a couple benefits:
* No longer _require_ `Clone` for general usage

* Provide speed improvements for free

The speed improvements here is subjective, because Arc has problems on higher core count intel CPUs. So I don't think this is as "free" as you may hope it to be.

* Access all current methods without any breaking changes (`get_mut` will require `Clone` in this proposal, but the entire data structure currently requires `Clone`, so I don't think it would break anything?)
Please let me know your thoughts!

Edit: I suppose it is worth mentioning that 6.4% speed increase clearly depends upon the Clone implementation in question, but I believe in almost all, if not all, Arc provides faster clones than memory copies?

Like anything complex - "It depends".

If you have a multi socket or high threadcount system, then Arc can actually be slower than memory copies. On ARM Mac's, this is inverted due to special hardware optimisations that make this almost free.

A big factor here is what is the shape and size of your value. For example, a value being a Uuid which is 16 bytes is virtually free to copy/clone and Arc would make that slower. Byte arrays benefit from memcpy instructions.

I think that the performance improvements really depends on the workload and the data, and that's why we had the interface require Clone like this because then the user is free to wrap their keys/values in Arc when it's situationally beneficial.

Thanks for the quick responses and the in depth explanations--learned quite a bit from you here and I really appreciate it!

I realized that in my situation, all values stored in the hash map have to be heap allocated, and because of that, a clone of an Arc is almost certainly going to be faster than the new heap allocation for a new Box (even for something lightweight like a u8), which I confirmed through a quick benchmark on my M1 Mac. But this is actually a question that I have for you: on a high core count Intel CPU, with multiple threads trying to clone a certain heap allocated value, would a Box::clone or Arc::clone be faster? I don't have a compatible device to test on so I would appreciate any insights here.

Closing this though, because in concread's case, we don't know if user keys/values are going to be boxed/similar. I think you are definitely right in that users should just wrap around an Arc around their data when the situation calls for it.

Thanks for the quick responses and the in depth explanations--learned quite a bit from you here and I really appreciate it!

I realized that in my situation, all values stored in the hash map have to be heap allocated, and because of that, a clone of an Arc is almost certainly going to be faster than the new heap allocation for a new Box (even for something lightweight like a u8), which I confirmed through a quick benchmark on my M1 Mac. But this is actually a question that I have for you: on a high core count Intel CPU, with multiple threads trying to clone a certain heap allocated value, would a Box::clone or Arc::clone be faster? I don't have a compatible device to test on so I would appreciate any insights here.

I'd need to sit down and test it again. It would depend on the size of the data as well. Arc on intel is slow because it triggers a lot of cache invalidation across the cores, where as the Box::clone provided the data you are cloning from is in the Shared state, will be faster since you bypass cache invalidation acknowledgements and can just write your own exclusive cache lines.

The reason m1 mac have this performance boost is because swift uses reference counts everywhere, so they added extra cpu optimisation to make arc faster. That's why you probably see this behaviour on the M1.

Closing this though, because in concread's case, we don't know if user keys/values are going to be boxed/similar. I think you are definitely right in that users should just wrap around an Arc around their data when the situation calls for it.

Yeah. That's the hardest part with structures like this - there are some decisions we can make that are "fast for everyone" but others that we can't make because "it varies".

So for example, the hashmap is a hashtrie underneath that clones it's nodes/values on the mod path, but it clones because we want to avoid invalidating cache lines that other cpu's may be using. This way we never trigger a performance loss for a reader either. We don't use Arc, we actually maintain free lists inside a transaction to track all the node live-ness and generations of when we can clean things up. This way we also sidestep the arc problem in favour of "some memory capacity overhead".

Anyway, there are a ton of considerations that went into these decisions over about 6 years :) it's never quite as simple as we hope. Lots of "it depends" and trade offs.

kanidm / concread

Consider adding `Arc`/atomic reference counting instead of eagerly cloning #101