someguynamedjosh / ouroboros

Easy self-referential struct generation for Rust.
Apache License 2.0
523 stars 35 forks source link

Question: Using with complex types like HashMap #83

Open nyurik opened 1 year ago

nyurik commented 1 year ago

I would like to build this type of a struct, just like I outlined in stackoverflow question. It seems Ouroboros is the closest to this, but I am still not certain if it's possible:

#[self_referencing]
struct StringIndex {
    store: Vec<String>,

    #[borrows(store)]
    #[covariant]
    index: HashMap<&'this str, usize>,
}

and implement some magical method to perform "lookup or push":

impl StringIndex {
  pub fn get_or_insert(&mut self, key: &str) -> usize {
    match self.index.entry(key) {
      Occupied(v) => v.key(),
      Vacant(v) => {
        let idx = self.store.len();
        // Create a new clone, and a borrowed ref to it
        let new_value  = key.to_string();
        let borrowed_value = &key;  // This wouldn't work in Rust
        self.store.push(new_value);
        self.index.insert(borrowed_value, idx);
        idx
      }
    }
  }
}
kpreid commented 1 year ago

The thing that ouroboros allows you to do is to keep a borrow that points to "this same object", but it's still subject to the rule that mutable access is exclusive access. Since store is borrowed perpetually, you can never get an &mut to it; doing so would be unsound (more than not possible without help — not something a valid program may ever do).

You can use an interior-mutable type as store, though. In particular, typed_arena::Arena might suffice; this compiles:

use std::collections::HashMap;
use typed_arena::Arena;

#[ouroboros::self_referencing]
struct StringIndex {
    store: Arena<String>,

    #[borrows(store)]
    #[covariant]
    index: HashMap<&'this str, usize>,
}

impl StringIndex {
    pub fn get_or_insert(&mut self, key: &str) -> usize {
        self.with_mut(|fields| -> usize {
            if let Some(&value) = fields.index.get(key) {
                value
            } else {
                let idx = fields.store.len();
                let stored_str: &str = fields.store.alloc(key.to_owned());
                fields.index.insert(stored_str, idx);
                idx
            }
        })
    }
}

However, given what you've written I imagine you also want to be able to get from usize to &str. typed_arena::Arena doesn't do that because it intends to offer mutability of the elements; what you want is an append-only list (which probably exists, but I don't know of one to recommend). You could do it by adding a second vector, but perhaps there's a better way than this:

use std::collections::HashMap;
use typed_arena::Arena;

#[ouroboros::self_referencing]
struct StringIndex {
    store: Arena<String>,

    #[borrows(store)]
    #[covariant]
    index: HashMap<&'this str, usize>,

    #[borrows(store)]
    #[covariant]
    rev_index: Vec<&'this str>,
}

impl StringIndex {
    pub fn get_or_insert(&mut self, key: &str) -> usize {
        self.with_mut(|fields| -> usize {
            if let Some(&value) = fields.index.get(key) {
                value
            } else {
                let idx = fields.store.len();
                let stored_str: &str = fields.store.alloc(key.to_owned());
                fields.rev_index.push(stored_str);
                fields.index.insert(stored_str, idx);
                idx
            }
        })
    }

    pub fn lookup_idx(&self, idx: usize) -> Option<&str> {
        self.borrow_rev_index().get(idx).copied()
    }
}
nyurik commented 1 year ago

@kpreid thank you for such an awesome write up and explanation!!! TIL :) You are absolutely correct -- my use case is an ever-growing (accumulator style) vector -- the user needs to build a vector of values (using Vec::push() only, no modifications/deletions) and use the index of those values. Once done inserting, the entire vector will be consumed. So the ideal interface to this accumulator/interner is:

// Note that this interface should support `&'static str`, `String`, or any other `T`s,
// but not a mix of them in the same container.
fn append<T>(&mut self, value: T) -> usize;
fn into(self) -> Vec<T>;

From the sounds of it, I may have to actually implement this as a separate crate with unsafe functionality - if the Vec is only growing, the non-exposed references to the vector's values shouldn't violate any Rust memory rules. So internally it would be a Vec<T> and HashMap<&T, usize>, and as long as I don't allow any kind of vector access or modification beyond push, this should be sound.

P.S. Possible optimization of the above is if the T is a &'static V -- in this case it might make sense for the internal HashMap to store T itself, and not a ref to it (avoid double redirect): HashMap<T, usize>

kpreid commented 1 year ago

the user needs to build a vector of values (using Vec::push() only, no modifications/deletions) and use the index of those values. Once done inserting, the entire vector will be consumed.

In this case, typed_arena is all you need, because it has Arena::into_vec(), and in order to get the Arena out, you can use ouroboros's generated into_heads() method. Just add this method to my first code sample (not tested):

    pub fn into_vec(self) -> Vec<String> {
        self.into_heads().store.into_vec()
    }

if the Vec is only growing, the non-exposed references to the vector's values shouldn't violate any Rust memory rules.

No, that's not reliably true: when the Vec grows, it allocates new memory and moves the existing data into it, invalidating references.

nyurik commented 1 year ago

thanks, when i tried, i realized that the ref was into Vec, not into the memory allocated for the string itself. I came up with this (without Arena): I simply duplicate the String object as is, and then implement the Drop trait to ensure that I don't do the double-free. Note that I couldn't figure out which version of the insert_or_get to call, so I had to be explicit.

mod stores {
    use std::collections::HashMap;
    use std::mem;

    #[derive(Default)]
    struct LeakyKeyHashMap<K, V>(pub HashMap<K, V>);

    impl<K, V> Drop for LeakyKeyHashMap<K, V> {
        fn drop(&mut self) {
            for k in mem::take(&mut self.0).into_keys() {
                mem::forget(k);
            }
        }
    }

    #[derive(Default)]
    pub struct Store<T> {
        keys: Vec<T>,
        index: LeakyKeyHashMap<T, usize>,
    }

    impl<T> Store<T> {
        pub fn into_vec(self) -> Vec<T> {
            self.keys
        }
    }

    impl Store<String> {
        pub fn insert_or_get(&mut self, key: String) -> usize {
            if let Some(index) = self.index.0.get(&key) {
                *index
            } else {
                let index = self.keys.len();
                let (ptr, len, cap) = (key.as_ptr(), key.len(), key.capacity());
                self.keys.push(key);
                let key_dup = unsafe { String::from_raw_parts(ptr as *mut u8, len, cap) };
                self.index.0.insert(key_dup, index);
                index
            }
        }
    }

    impl Store<&'static str> {
        pub fn insert_or_get(&mut self, key: &'static str) -> usize {
            if let Some(index) = self.index.0.get(&key) {
                *index
            } else {
                let index = self.keys.len();
                self.keys.push(key);
                self.index.0.insert(key, index);
                index
            }
        }
    }
}

fn main() {
    use stores::Store;

    let mut store = Store::default();

    let foo_idx = Store::<String>::insert_or_get(&mut store, "foo".to_string());
    Store::<String>::insert_or_get(&mut store, "bar".to_string());
    let foo_idx2 = Store::<String>::insert_or_get(&mut store, "foo".to_string());
    assert_eq!(foo_idx, foo_idx2);

    let keys = store.into_vec();
    assert_eq!(keys, vec!["foo".to_string(), "bar".to_string()]);
}
nyurik commented 1 year ago

I made a new crate with the above idea, hope it is "sound" :) https://github.com/nyurik/dup-indexer