jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.06k stars 221 forks source link

MutableDictionaryArray rewrite: use values stored in the array instead of the hash->hash map #1555

Closed aldanor closed 1 year ago

aldanor commented 1 year ago

@ritchie46 @jorgecarleitao Here's one way to fix the incorrect current behaviour of MutableDictionaryArray: only rely on values actually stored in the values array and don't rely on hash-hash maps (due to potential hash collisions).

This is almost a rewrite of the whole thing, outer API aside, and gets pretty evil (self-referential) at the lower level, but I believe it's pretty sound.

There's probably bits and pieces that may need to be cleaned, types and methods to be renamed, potentially some docstrings added etc (comments welcome), but I figured I'd push the current version as soon as it's working so as to figure whether something like this would be acceptable.

Bench-wise, current main (10k dict insertions):

dict_utf8               time:   [239.70 µs 241.27 µs 242.78 µs]
dict_u64                time:   [123.60 µs 123.78 µs 123.98 µs]

This branch:

dict_utf8               time:   [151.54 µs 151.75 µs 151.97 µs]
                        change: [-37.274% -36.837% -36.411%] (p = 0.00 < 0.05)
dict_u64                time:   [42.466 µs 42.522 µs 42.581 µs]
                        change: [-65.705% -65.627% -65.547%] (p = 0.00 < 0.05)

Fixes #1485 Fixes #1554

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 58.82% and project coverage change: -0.43% :warning:

Comparison is base (ba6a882) 83.34% compared to head (9291210) 82.92%. Report is 1 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1555 +/- ## ========================================== - Coverage 83.34% 82.92% -0.43% ========================================== Files 388 391 +3 Lines 42212 42750 +538 ========================================== + Hits 35183 35450 +267 - Misses 7029 7300 +271 ``` | [Files Changed](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1555?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao) | Coverage Δ | | |---|---|---| | [src/array/dictionary/mod.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1555?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L2RpY3Rpb25hcnkvbW9kLnJz) | `95.32% <ø> (ø)` | | | [src/array/mod.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1555?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L21vZC5ycw==) | `79.72% <ø> (ø)` | | | [src/array/indexable.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1555?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L2luZGV4YWJsZS5ycw==) | `30.98% <30.98%> (ø)` | | | [src/array/dictionary/value\_map.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1555?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L2RpY3Rpb25hcnkvdmFsdWVfbWFwLnJz) | `70.43% <70.43%> (ø)` | | | [src/array/dictionary/mutable.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1555?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2FycmF5L2RpY3Rpb25hcnkvbXV0YWJsZS5ycw==) | `70.49% <75.75%> (-8.65%)` | :arrow_down: | | [src/compute/cast/primitive\_to.rs](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1555?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao#diff-c3JjL2NvbXB1dGUvY2FzdC9wcmltaXRpdmVfdG8ucnM=) | `93.04% <100.00%> (ø)` | | ... and [8 files with indirect coverage changes](https://app.codecov.io/gh/jorgecarleitao/arrow2/pull/1555/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Jorge+Leitao)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

aldanor commented 1 year ago

Clippy errors: can fix in a later PR along with some deps updates (unrelated to this PR, some deprecated chrono stuff).

Miri errors: not sure can prove it to miri here that it's not UB.

sundy-li commented 1 year ago

Great PR. This could improve grouping performance by low cardinality column data in other dbms.

ritchie46 commented 1 year ago

@sundy-li you were too fast. I also wanted to review this one. Especially given that MIRI isn't happy. :/

ritchie46 commented 1 year ago

@aldanor I think we can achieve the same with a Hashmap<K, usize> where the usize gets the value in the MutableArray.

IAC This will save a lot of (undsafe) code, and make the values of the hashmap smaller as you don't have two pointers (the ref and the index), but only a single index.

We can use that something like this:

        self.inner_map
            .raw_entry_mut()
            .from_hash(hash, |hash_map_key| {
                  let index = hash_map_key.index;

                  self.get_value(index) == insert_value
                }
            })

I think we should go in this direction.

aldanor commented 1 year ago

@ritchie46 Sure, sounds like a plan. Tbh I didn't even think of the raw hashmap API since I got too used to the fact that it's nightly-only (but not for hashbrown!).

All the "Indexable" stuff is unfortunately still needed since we have no clean way of connecting "stuff we're pushing" to "stuff we're storing" elsewise.

I'll take a look at the raw hashmap approach now and report back.

aldanor commented 1 year ago

(I'll try to look into it asap so maybe don't merge the revert PR first, so we don't end up with the same code flipped back and forth 3 times...)

aldanor commented 1 year ago

@ritchie46 I think you meant HashMap<usize, K> (as opposed to HashMap<K, usize>) 🙂

But anyways, it does seem to work, I'll push PR shortly.