Closed aldanor closed 10 months ago
Patch coverage: 71.71%
and project coverage change: -0.06%
:warning:
Comparison is base (
b2017d7
) 83.08% compared to head (1d8afd9
) 83.02%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
(Don't merge yet, I'll add a few very quick tests)
Nice! Thanks for the quick turnaround. We might also explore making the index type generic, so that we can support all unsigned indexes and get better cache coherence.
But that's definitely not a blocker for this PR.
We might also explore making the index type generic, so that we can support all unsigned indexes and get better cache coherence.
As in <K, K>
? (it will fail otherwise anyway upon key conversion). Or, hold on, even <K, ()>
(i.e. technically, a HashSet
), because keys and values are equal anyways...
You would need to convert is back and forth to usize
though to get the actual array index, not quite sure if cache coherence offsets potential extra branching (might quickly check though).
@ritchie46 ok, using <K, ()>
seems to work, there's two minor quirks though:
RawEntryMutBuilder::from_hash()
, we cannot handle errors other than unwrap results since it's a closure in which we don't control return type, but I think there we can use unsafe { key.as_usize() }
since we've already converted that key from usize previously, so it implies it can be unsafely converted back?DictionaryKey
has to derive from Hash
just so that some HashMap bounds are satisfied (which is a bit dumb because we'll never actually hash it, but it doesn't matter that much since those are just ints). Tbh we might as well just derive NativeType: Hash
since those are just primitives... (or just the DictionaryKey).Benches seem to improve a bit further by ~10%: (now 1.7x from original for utf8, 3.5x for u64)
dict_utf8 time: [141.11 µs 141.44 µs 141.82 µs]
change: [-11.502% -11.251% -10.984%] (p = 0.00 < 0.05)
Performance has improved.
dict_u64 time: [35.081 µs 35.169 µs 35.259 µs]
change: [-8.4965% -8.1730% -7.8369%] (p = 0.00 < 0.05)
Performance has improved.
Can clean it up a bit and push it too here if it's of interest.
(Pushed the <K, ()>
commit as well to see if that makes sense)
I think it only makes sense because we fully manage MutableDictionArray
internally, i.e. you can't directly load those internal indices from anywhere else.
Since #1559 has been merged, I'll have to rebase this on main again, ok then... 🤦
Ok, I've rebased on revert of the revert, should be ok now.
Tests were added.
Unit-value hashmap added = waiting on feedback from @ritchie46 (if I missed something and there's some gaps we can always revert that part if needed).
Otherwise, nothing else left to do in this PR I think unless there's any particular comments, so should be good to go. (clippy error in CI is some unrelated spurious network error)
Since https://github.com/jorgecarleitao/arrow2/pull/1559 has been merged, I'll have to rebase this on main again, ok then... 🤦
@sundy-li is trigger happy. :see_no_evil: :stuck_out_tongue_winking_eye:
Thanks for doing the rewrite @aldanor. Good to go.
As suggested by @ritchie46, using hashbrown's raw-entry API (flipped
<K, usize>
to<usize, K>
though).Need to be very careful to not use any default map API though since it will break map consistency (like .insert(), .collect(), etc).
Benchmarks are roughly the same as in #1555, not slower, not faster; miri should be happy though.