perf: consolidate per-slot state into one AtomicUsize

This change moves all the per-slot shared state (generation, ref count, and removal state) into a single AtomicUsize. This has several advantages:

It reduces the overall complexity of the Slot type, as it no longer depends on the complex interactions of multiple atomics. The loom tests are now much faster (which is also a nice sign of relative complexity, IMO), and the code is easier to reason about.
All interactions with the generation will now involve a RMW. Even when the generation is not being modified, we will always perform a read-modify-write with that generation to update some part of the state (such as the ref count or removal state). If this RMW fails because our view of the generation is stale, we'll re-acquire the state, and see that the generation has changed. This will ensure that the generation counter actually guards against reads with a a stale generation.
Generation ops need no longer be sequentially consistent.
Slots are a word smaller :)

There isn't really any noticeable performance impact before/after. The "after" benchmarks are generally about ~2-5% faster across the board, but I'm not sure if this is really significant (even though Criterion claims it is).

hawkw / sharded-slab

perf: consolidate per-slot state into one AtomicUsize #6