jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 221 forks source link

2nd (safe) rewrite of MutableDictionaryArray #1561

Closed aldanor closed 10 months ago

aldanor commented 10 months ago

As suggested by @ritchie46, using hashbrown's raw-entry API (flipped <K, usize> to <usize, K> though).

Need to be very careful to not use any default map API though since it will break map consistency (like .insert(), .collect(), etc).

Benchmarks are roughly the same as in #1555, not slower, not faster; miri should be happy though.

codecov[bot] commented 10 months ago

Codecov Report

Patch coverage: 71.71% and project coverage change: -0.06% :warning:

Comparison is base (b2017d7) 83.08% compared to head (1d8afd9) 83.02%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1561 +/- ## ========================================== - Coverage 83.08% 83.02% -0.06% ========================================== Files 389 391 +2 Lines 42640 42786 +146 ========================================== + Hits 35427 35525 +98 - Misses 7213 7261 +48 ``` | [Files Changed](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1561?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) | Coverage Δ | | |---|---|---| | [src/array/dictionary/mod.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1561?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L2RpY3Rpb25hcnkvbW9kLnJz) | `95.32% <ø> (ø)` | | | [src/array/mod.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1561?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L21vZC5ycw==) | `79.72% <ø> (ø)` | | | [src/array/indexable.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1561?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L2luZGV4YWJsZS5ycw==) | `40.84% <40.84%> (ø)` | | | [src/array/dictionary/mutable.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1561?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L2RpY3Rpb25hcnkvbXV0YWJsZS5ycw==) | `73.28% <83.33%> (-5.86%)` | :arrow_down: | | [src/array/dictionary/value\_map.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1561?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L2RpY3Rpb25hcnkvdmFsdWVfbWFwLnJz) | `91.56% <91.56%> (ø)` | | | [src/compute/cast/primitive\_to.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1561?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2NvbXB1dGUvY2FzdC9wcmltaXRpdmVfdG8ucnM=) | `93.04% <100.00%> (ø)` | | ... and [5 files with indirect coverage changes](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1561/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

aldanor commented 10 months ago

(Don't merge yet, I'll add a few very quick tests)

ritchie46 commented 10 months ago

Nice! Thanks for the quick turnaround. We might also explore making the index type generic, so that we can support all unsigned indexes and get better cache coherence.

But that's definitely not a blocker for this PR.

aldanor commented 10 months ago

We might also explore making the index type generic, so that we can support all unsigned indexes and get better cache coherence.

As in <K, K>? (it will fail otherwise anyway upon key conversion). Or, hold on, even <K, ()> (i.e. technically, a HashSet), because keys and values are equal anyways...

You would need to convert is back and forth to usize though to get the actual array index, not quite sure if cache coherence offsets potential extra branching (might quickly check though).

aldanor commented 10 months ago

@ritchie46 ok, using <K, ()> seems to work, there's two minor quirks though:

Benches seem to improve a bit further by ~10%: (now 1.7x from original for utf8, 3.5x for u64)

dict_utf8               time:   [141.11 µs 141.44 µs 141.82 µs]
                        change: [-11.502% -11.251% -10.984%] (p = 0.00 < 0.05)
                        Performance has improved.

dict_u64                time:   [35.081 µs 35.169 µs 35.259 µs]
                        change: [-8.4965% -8.1730% -7.8369%] (p = 0.00 < 0.05)
                        Performance has improved.

Can clean it up a bit and push it too here if it's of interest.

aldanor commented 10 months ago

(Pushed the <K, ()> commit as well to see if that makes sense)

I think it only makes sense because we fully manage MutableDictionArray internally, i.e. you can't directly load those internal indices from anywhere else.

aldanor commented 10 months ago

Since #1559 has been merged, I'll have to rebase this on main again, ok then... 🤦

aldanor commented 10 months ago

Ok, I've rebased on revert of the revert, should be ok now.

Tests were added.

Unit-value hashmap added = waiting on feedback from @ritchie46 (if I missed something and there's some gaps we can always revert that part if needed).

Otherwise, nothing else left to do in this PR I think unless there's any particular comments, so should be good to go. (clippy error in CI is some unrelated spurious network error)

ritchie46 commented 10 months ago

Since https://github.com/jorgecarleitao/arrow2/pull/1559 has been merged, I'll have to rebase this on main again, ok then... 🤦

@sundy-li is trigger happy. :see_no_evil: :stuck_out_tongue_winking_eye:

ritchie46 commented 10 months ago

Thanks for doing the rewrite @aldanor. Good to go.