jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 223 forks source link

Mutable dictionary array is wrong #1485

Closed ritchie46 closed 1 year ago

ritchie46 commented 1 year ago

It stores the hash as a key, but doesn't do any key comparison on insert. Though the probability of collisions is low, it is not zero, so for certain inputs the keys will collide and produce invalid arrays.

aldanor commented 1 year ago

@ritchie46 Any suggestions on what's the proper way to handle hash collisions?

ritchie46 commented 1 year ago

We should store the keys in the hashmap, not the hash.

aldanor commented 1 year ago

You mean the "values" (in terms of MutableDictionaryArray<K, V>)? But then we'd need to change its signature to MutableDictionary<K, V, T> and it will become pretty weird re: linking T and V together, no? (i.e. via where V: Hash + TryExtend<Option<T>> or where V: Hash + TryPush<Option<T>> depending on the context...)

aldanor commented 1 year ago

E.g., we currently have

impl<O, T> TryPush<Option<T>> for MutableUtf8Array<O>
where
    O: Offset,
    T: AsRef<str>,
{}

impl<K, M, T> TryPush<Option<T>> for MutableDictionaryArray<K, V>
where
    K: DictionaryKey,
    M: MutableArray + TryPush<Option<T>>,
    T: Hash,
{}

If you change it to

MutableDictionaryArray<K, V, T>

then you'd need some way of converting values that are being pushed (e.g. &str) into values that are stored in the hashmap (e.g. String).

impl<K, M, T> TryPush<Option<X>> for MutableDictionaryArray<K, V, T>
where
    K: DictionaryKey,
    M: MutableArray + TryPush<Option<X>>,
    T: Hash,
    X: Into<T>, // <--- ?
{}
aldanor commented 1 year ago

I guess another way is adding a trait that would map V: MutableArray -> T and implement it for all basic types, so then the signature can be kept as <K, V>; not sure it's doable though, and how it would work with more complex types other than basic scalars and strings...

And another question is this – how do you implement it so as to avoid allocating a new String on every insert into Dictionary<K, MutableUtf8Array>? (and only allocating it when a new key needs to be inserted)

aldanor commented 1 year ago

Ok, here's one other weird idea... :)

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=133ed2ac046593787b290bbc4f1d2415

(replace PinnedVec<T> with a sub-trait of MutableArray that knows how to get an item reference at given index for comparison and hashing)

However, then you have to access/index that array every time before you can hash it which will inevitably end up being slower. Perhaps you can compute hash on the value itself and only use indirect array access for equality checks. Note that while it's faster it is possible to break though (which is also currently the case) if you do something like:

pub struct MyWeirdType { s: String }
impl AsRef<str> for MyWeirdType { ... }
impl Hash for MyWeirdType { /* return 0 */ }

dict.try_push(Some(my_weird_type)); // MutableUtf8Array accepts any AsRef<str>
// => you've broken it because it hashes your type and not the underlying str
aldanor commented 1 year ago

(Link to the playground was wrong, updated ^)