rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
96.87k stars 12.51k forks source link

Tracking Issue for BTreeMap cursors #107540

Open Amanieu opened 1 year ago

Amanieu commented 1 year ago

Feature gate: #![feature(btree_cursors)]

ACP: https://github.com/rust-lang/libs-team/issues/141

This is a tracking issue for the Cursor and CursorMut types for BTreeMap.

A Cursor is like an iterator, except that it can freely seek back-and-forth, and can safely mutate the tree during iteration. This is because the lifetime of its yielded references is tied to its own lifetime, instead of just the underlying tree. A cursor either points to an element in the tree, or to a "ghost" non-element that is logically located after the last element and before the first element.

Public API

impl<K, V> BTreeMap<K, V> {
    fn lower_bound<Q>(&self, bound: Bound<&Q>) -> Cursor<'_, K, V>
    where
        K: Borrow<Q> + Ord,
        Q: Ord;
    fn lower_bound_mut<Q>(&mut self, bound: Bound<&Q>) -> CursorMut<'_, K, V>
    where
        K: Borrow<Q> + Ord,
        Q: Ord;
    fn upper_bound<Q>(&self, bound: Bound<&Q>) -> Cursor<'_, K, V>
    where
        K: Borrow<Q> + Ord,
        Q: Ord;
    fn upper_bound_mut<Q>(&mut self, bound: Bound<&Q>) -> CursorMut<'_, K, V>
    where
        K: Borrow<Q> + Ord,
        Q: Ord;
}

struct Cursor<'a, K: 'a, V: 'a>;

impl<'a, K, V> Cursor<'a, K, V> {
    fn move_next(&mut self);
    fn move_prev(&mut self);

    fn key(&self) -> Option<&'a K>;
    fn value(&self) -> Option<&'a V>;
    fn key_value(&self) -> Option<(&'a K, &'a V)>;

    fn peek_next(&self) -> Option<(&'a K, &'a V)>;
    fn peek_prev(&self) -> Option<(&'a K, &'a V)>;
}

struct CursorMut<'a, K: 'a, V: 'a>;

impl<'a, K, V> CursorMut<'a, K, V> {
    fn move_next(&mut self);
    fn move_prev(&mut self);

    fn key(&self) -> Option<&K>;
    fn value(&self) -> Option<&V>;
    fn value_mut(&mut self) -> Option<&mut V>;
    fn key_value(&self) -> Option<(&K, &V)>;
    fn key_value_mut(&self) -> Option<(&K, &mut V)>;

    unsafe fn key_mut_unchecked(&mut self) -> Option<&mut K>;

    fn peek_next(&self) -> Option<(&K, &V)>;
    fn peek_prev(&self) -> Option<(&K, &V)>;

    fn as_cursor(&self) -> Cursor<'_, K, V>;

    fn insert_after(&mut self, key: K, value: V);
    fn insert_before(&mut self, key: K, value: V);

    unsafe fn insert_after_unchecked(&mut self, key: K, value: V);
    unsafe fn insert_before_unchecked(&mut self, key: K, value: V);

    fn remove_current(&mut self) -> Option<(K, V)>;
    fn remove_current_and_move_back(&mut self) -> Option<(K, V)>;
}

Steps / History

Unresolved Questions

Yurihaia commented 8 months ago

are you allowed to bulk modify keys with key_mut_unchecked or do other seeking/mutating as long as you don't call any of the functions on CursorMut that require the key to be Ord?

I have a use case where i need to update a bunch of entries (and the values somewhat depend on the keys), and it is incredibly nice to be able to be able to do this without a lot of auxiliary storage or searching and whatever.

I assume that as long as you can guarantee that the keys end up back in sorted order by the end it should be safe to do? (obviously guaranteeing that panics don't mess with things, etc).

And if this is something that is explicitly allowed, imo it should be added to the documentation of key_mut_unchecked or whatever other unsafe thing is in the newer version.

Amanieu commented 8 months ago

Yes, that would work with the way it's implemented right now, but I'm not sure how to properly word the guarantees in the documentation.

Yurihaia commented 8 months ago

At least with the new API, maybe putting a note on each of the methods that don't care about key order that says something like

This method does not depend on the Ord implementation of K.

Obviously its already technically there when you look through the docs and see that its implemented on cursors even when K: !Ord, but it being explicit should hopefully make it clearer.

I suppose also just putting something on with_mutable_key that says something along the lines of

While keys are generally required to be unique and correctly ordered, methods on these cursors that do not depend on the Ord implementation of K are guaranteed to work correctly even when the key ordering is invalid.

I think its probably a niche enough use case that it can be a bit buried and still be fine. Just having something that documents which methods can be called without caring about key ordering is good enough.

Yurihaia commented 8 months ago

Actually, now that I look at it, the majority of the Cursor API actually does work completely without an Ord implementation (and hence when the map is in a bad order). This might be a nice property to guarantee/document, because it could certainly improve performance quite a bit in certain use cases (for example, when using the btree more because it is ordered and less as an actual map).

zacknewman commented 7 months ago

The new commit causes an illegal instruction in debug mode but not release. I'm sure below can be reduced further, but here is an example.

[zack@laptop src]$ uname -a
Linux laptop 6.7.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 26 Jan 2024 19:10:20 +0000 x86_64 GNU/Linux
[zack@laptop src]$ cargo version
cargo 1.77.0-nightly (7bb7b5395 2024-01-20)
[zack@laptop src]$ cargo run
   Compiling foo v0.1.0 (/home/zack/projects/foo)
    Finished dev [unoptimized + debuginfo] target(s) in 0.17s
     Running `/home/zack/projects/foo/target/debug/foo`
thread 'main' panicked at /rustc/635124704849eeead4e3a7bb6e663c5351571d93/library/alloc/src/collections/btree/navigate.rs:601:48:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Illegal instruction (core dumped)
#![feature(btree_cursors)]
extern crate alloc;
use alloc::collections::BTreeMap;
use core::{cmp::Ordering, ops::Bound};
#[derive(Eq, PartialEq)]
struct ClosedInterval {
    min: usize,
    max: usize,
}
impl PartialOrd<Self> for ClosedInterval {
    fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
        Some(self.cmp(other))
    }
}
impl Ord for ClosedInterval {
    /// The order is defined such that sets are equal to themselves.
    /// A set is greater than all proper subsets.
    /// Otherwise we simply order by `max`.
    ///
    /// For example [2, 3] = [2, 3] < [1, 3] < [1, 4] < [0, 5] < [1, 6].
    fn cmp(&self, other: &Self) -> Ordering {
        match self.min.cmp(&other.min) {
            Ordering::Less => {
                if self.max >= other.max {
                    Ordering::Greater
                } else {
                    Ordering::Less
                }
            }
            Ordering::Equal => self.max.cmp(&other.max),
            Ordering::Greater => {
                if self.max > other.max {
                    Ordering::Greater
                } else {
                    Ordering::Less
                }
            }
        }
    }
}
fn main() {
    let mut map = BTreeMap::new();
    insert(&mut map, ClosedInterval { min: 1, max: 4 });
    insert(&mut map, ClosedInterval { min: 4, max: 6 });
}
fn insert(map: &mut BTreeMap<ClosedInterval, ()>, key: ClosedInterval) {
    let mut cursor = map.lower_bound_mut(Bound::Included(&key));
    if let Some(ge) = cursor.peek_next() {
        if *ge.0 >= key {
            return;
        }
    }
    while let Some(lt) = cursor.prev() {
        if key > *lt.0 {
            cursor.remove_prev();
        } else {
            break;
        }
    }
    // Causes an illegal instruction.
    let _x = cursor.insert_after(key, ());
}
zacknewman commented 7 months ago

Chalk me up as one of the "weirdos" that thought the old API made sense, lol. For example the new description for lower_bound is "Returns a Cursor pointing to the first gap above the given bound". So if I have a map that includes key, then lower_bound(Bound::Included(&key)) should return a cursor "pointing to the first gap above key" (emphasis added). Wouldn't this imply that cursor.peek_prev() returns key since it's you know "above" key? The below example illustrates it's cursor.peek_next() that returns key.

#![feature(btree_cursors)]
extern crate alloc;
use alloc::collections::BTreeMap;
use core::ops::Bound;
fn main() {
    let mut map = BTreeMap::new();
    map.insert(1i32, 1i32);
    map.insert(2i32, 2i32);
    map.insert(3i32, 3i32);
    // Cursor should be "above" 2, shouldn't `peek_prev` return it?
    assert!(map
        .lower_bound(Bound::Included(&2i32))
        .peek_prev()
        .map_or(false, |tup| 1i32 == *tup.0));
    // Cursor should be "above" 2, shouldn't `peek_next` return 3?
    assert!(map
        .lower_bound(Bound::Included(&2i32))
        .peek_next()
        .map_or(false, |tup| 2i32 == *tup.0));
}

Similarly upper_bound states "Returns a Cursor pointing at the last gap below the given bound". So if I have a map that includes key, then upper_bound(Bound::Included(&key)) should return a cursor "pointing to the last gap below key" (emphasis added). Wouldn't this imply that cursor.peek_next() returns key since it's you know "below" key? The below example illustrates it's cursor.peek_prev() that returns key.

#![feature(btree_cursors)]
extern crate alloc;
use alloc::collections::BTreeMap;
use core::ops::Bound;
fn main() {
    let mut map = BTreeMap::new();
    map.insert(1i32, 1i32);
    map.insert(2i32, 2i32);
    map.insert(3i32, 3i32);
    // Cursor should be "below" 2, shouldn't `peek_prev` return 1?
    assert!(map
        .upper_bound(Bound::Included(&2i32))
        .peek_prev()
        .map_or(false, |tup| 2i32 == *tup.0));
    // Cursor should be "below" 2, shouldn't `peek_next` return it?
    assert!(map
        .upper_bound(Bound::Included(&2i32))
        .peek_next()
        .map_or(false, |tup| 3i32 == *tup.0));
}

Because of my confusion, I'm struggling refactoring some code that used the old API. I have a specialized wrapper type, SupersetMap around BTreeMap whose keys are sets. The order required for these sets is such that in addition to conforming to Ord, a set is strictly greater than all proper subsets. A consequence of this is that a superset of a set, A, exists iff the first value greater than or equal to A is a superset of A. The way insert was implemented before was the following:

/// `(key, value)` is inserted iff there doesn't already
/// exist a `K` that is a superset of `key`.
/// In the event `(key, value)` will be inserted, all `(K, V)`s
/// where the `K` is a subset of `key` are first removed before
/// inserting.
pub fn insert(&mut self, key: K, value: V) -> bool {
    // Since the only way to `insert` a `(K, V)` is via this function,
    // the following is true:
    // The smallest value greater than or equal to `key` will be a superset
    // of `key` iff there exists a superset of `key`.
    // This means that we simply have to check this one value to decide
    // if `(key, value)` needs to be inserted; furthermore, in the event `(key, value)`
    // does need to be inserted, we only have to traverse backwards until
    // the beginning of `map` or a `K` is not a subset of `key`
    // removing all `(K, V)`s where the `K` is a subset before we insert `(key, value)`.
    let mut cursor = self.map.lower_bound_mut(Bound::Included(&key));
    if let Some(ge) = cursor.key() {
        if ge.is_superset(&key) {
            return false;
        }
    }
    cursor.move_prev();
    while let Some(lt) = cursor.key() {
        if key.is_proper_superset(lt) {
            cursor.remove_current_and_move_back();
        } else {
            break;
        }
    }
    // If `key` already existed in the map, then cursor
    // would begin pointing at it. Since `Set::is_superset`
    // returns true when both values are the same, we would
    // immediately return from this function. This guarantees
    // that `key` is unique before inserting.
    // Since we only move to smaller values and each time we do
    // we remove said value (or stop), the insert of `key` will occur after
    // all smaller values and before larger values ensuring the sorted
    // order of `map` is retained.
    // Note that due to the requirements of `SetOrd`, a `K` that
    // `is_proper_superset` of another `K` is also strictly greater.
    cursor.insert_after(key, value).unwrap_or_else(
        |_| panic!("there is a bug in how keys implement Ord or Set or there is a bug in BTreeMap")
    );
    true
}

I tried re-writing this using the new API, but I don't get the correct result:

pub fn insert(&mut self, key: K, value: V) -> bool {
    let mut cursor = self.map.lower_bound_mut(Bound::Included(&key));
    // Based on the perceived behavior of `lower_bound_mut` with an included bound,
    // `peek_next` should give `key` iff it is exists; otherwise the first value strictly
    // greater than it will be returned.
    if let Some(ge) = cursor.peek_next() {
        if ge.0.is_superset(&key) {
            return false;
        }
    }
    while let Some(lt) = cursor.prev() {
        if key.is_proper_superset(lt.0) {
            cursor.remove_next();
        } else {
            break;
        }
    }
    cursor.insert_after(key, value).unwrap_or_else(
        |_| panic!("there is a bug in how keys implement Ord or Set or there is a bug in BTreeMap")
    );
    true
}

I'd appreciate any assistance.

jonhoo commented 7 months ago

I've been using this for some performance optimization stuff at $WORK, and have been able to reap some significant performance improvements from it that I'm excited to be able to land when this stabilizes. A few observations from that experimentation:

  1. I really missed having this on BTreeSet
  2. The naming of insert_before and insert_after threw me for a bit because I (incorrectly) assumed that it'd be correct to change a peek_next() + insert_after() combo into a next() + insert_before() combo. Whereas in reality the latter requires a call to .prev() and the s/after/before/ does not actually make a difference for where the item is inserted. I don't have a proposal for a better name that could have avoided that misunderstanding though. It was obvious after reading the docs, I just wish the method name had given me a clue.
  3. I am observing that peek_next() + remove_next()/next() is faster than next() + remove_prev(), which was quite surprising to me. I don't know if this relates to how the CursorMut traverses internally, but I've verified it now with multiple benchmarks (admittedly internally in my codebase).
  4. Implementing your own "insert at the right position" loop with this leads to a bit of awkwardness when you arrive at the last element. Specifically, the if r.is_some() inside this:
    loop {
        match cursor.next().map(|(next_key, _)| next_key.cmp(&new_key)) {
            Some(Ordering::Less) => {
                // keep walking
                continue;
            }
            Some(Ordering::Equal) => {
                // don't overwrite
                break;
            }
            r @ Some(_) | r @ None => {
                if r.is_some() {
                    cursor.prev();
                }
                cursor.insert_before(next_key, ()).unwrap();
                break;
            }
        }
    }

    (or I can match on None as a separate branch instead). Either way, the need for that prev() is a little odd, and also comes with a performance penalty based on my measurements at least. So I wonder whether it might make sense for .prev() to do nothing the first time it's called after .next() yields None? Or something? It's just a weird-corner case of the new by-gap cursor movement (which I otherwise like!).

Amanieu commented 7 months ago

The new commit causes an illegal instruction in debug mode but not release. I'm sure below can be reduced further, but here is an example.

Should be fixed by https://github.com/rust-lang/rust/pull/120505.

zacknewman commented 7 months ago

I've been using this for some performance optimization stuff at $WORK, and have been able to reap some significant performance improvements from it that I'm excited to be able to land when this stabilizes. A few observations from that experimentation:

  1. I really missed having this on BTreeSet

Same, but it is planned for this API to be added.

  1. The naming of insert_before and insert_after threw me for a bit because I (incorrectly) assumed that it'd be correct to change a peek_next() + insert_after() combo into a next() + insert_before() combo. Whereas in reality the latter requires a call to .prev() and the s/after/before/ does not actually make a difference for where the item is inserted. I don't have a proposal for a better name that could have avoided that misunderstanding though. It was obvious after reading the docs, I just wish the method name had given me a clue.

It was not obvious to me.

  1. Implementing your own "insert at the right position" loop with this leads to a bit of awkwardness when you arrive at the last element. Specifically, the if r.is_some() inside this:
    loop {
    match cursor.next().map(|(next_key, _)| next_key.cmp(&new_key)) {
        Some(Ordering::Less) => {
            // keep walking
            continue;
        }
        Some(Ordering::Equal) => {
            // don't overwrite
            break;
        }
        r @ Some(_) | r @ None => {
            if r.is_some() {
                cursor.prev();
            }
            cursor.insert_before(next_key, ()).unwrap();
            break;
        }
    }
    }

    (or I can match on None as a separate branch instead). Either way, the need for that prev() is a little odd, and also comes with a performance penalty based on my measurements at least. So I wonder whether it might make sense for .prev() to do nothing the first time it's called after .next() yields None? Or something? It's just a weird-corner case of the new by-gap cursor movement (which I otherwise like!).

Thank you for this. That answers my plea for help. Specifically insert needed to become:

pub fn insert(&mut self, key: K, value: V) -> bool {
    let mut cursor = self.map.lower_bound_mut(Bound::Included(&key));
    if let Some(ge) = cursor.peek_next() {
        if ge.0.is_superset(&key) {
            return false;
        }
    }
    loop {
        match cursor.prev() {
            None => break,
            Some(lt) => {
                if key.is_proper_superset(lt.0) {
                    cursor.remove_next();
                } else {
                    cursor.next();
                    break;
                }
            }
        }
    }
    cursor.insert_before(key, value).unwrap_or_else(|_| panic!("there is a bug in how SetOrd is implemented or a bug in BTreeMap"));
    true
}

which is not quite as nice as it was before:

pub fn insert(&mut self, key: K, value: V) -> bool {
    let mut cursor = self.map.lower_bound_mut(Bound::Included(&key));
    if let Some(ge) = cursor.key() {
        if ge.is_superset(&key) {
            return false;
        }
    }
    cursor.move_prev();
    while let Some(lt) = cursor.key() {
        if key.is_proper_superset(lt) {
            cursor.remove_current_and_move_back();
        } else {
            break;
        }
    }
    cursor.insert_after(key, value).unwrap_or_else(
        |_| panic!("there is a bug in how keys implement Ord or Set or there is a bug in BTreeMap")
    );
    true
}
momvart commented 7 months ago

Chalk me up as one of the "weirdos" that thought the old API made sense, lol.

I have the same problem. I was also confused by how bounds are interpreted. I think the term "above" is misleading. Finally, after quite a bit of time I came up with the following rules.

let mut a = BTreeMap::new();
a.insert(1, "a");
a.insert(2, "b");
a.insert(3, "c");
a.insert(4, "d");

// a: [1, 2, 3, 4]
// The range of Included(&2) to Unbounded:      [.. , 2 , ..]
//              [2, +∞)                             ---------
let cursor = a.lower_bound(Bound::Included(&2));
// The actual cursor:                           [1 , 2 , 3 , ..]
//                                                 ^
assert_eq!(cursor.peek_prev(), Some((&1, &"a")));
assert_eq!(cursor.peek_next(), Some((&2, &"b")));
// The range of Excluded(&2) to Unbounded:      [.. , 2 , ..]
//              (2, +∞)                                 -----
let cursor = a.lower_bound(Bound::Excluded(&2));
// The actual cursor:                           [1 , 2 , 3 , ..]
//                                                     ^
assert_eq!(cursor.peek_prev(), Some((&2, &"b")));
assert_eq!(cursor.peek_next(), Some((&3, &"c")));
// Conclusion: for lower_bound, the next node of the cursor is within the range.

// The range of Unbounded to Included(&3):      [.. , 3 , ..]
//              (-∞, 3]                         ---------
let cursor = a.upper_bound(Bound::Included(&3));
// The actual cursor:                           [.. , 2 , 3 , 4]
//                                                          ^
assert_eq!(cursor.peek_prev(), Some((&3, &"c")));
assert_eq!(cursor.peek_next(), Some((&4, &"d")));
// The range of Unbounded to Excluded(&3):      [.. , 3 , ..]
//              (-∞, 3)                         -----
let cursor = a.upper_bound(Bound::Excluded(&3));
// The actual cursor:                           [.. , 2 , 3 , 4]
//                                                      ^
assert_eq!(cursor.peek_prev(), Some((&2, &"b")));
assert_eq!(cursor.peek_next(), Some((&3, &"c")));
// Conclusion: for upper_bound, the previous node of the cursor is within the range.
Amanieu commented 7 months ago

Feel free to submit a PR if you think you can make the documentation better. We can further refine the wording in that PR.

ripytide commented 7 months ago

I had a go at improving some of the documentation in #120936

GrigorenkoPV commented 1 month ago

Sorry, I did not read the entire discussion. Could anyone, please, briefly summarize why the decision was made to enable key mutation via a separate CursorMutKey and not via unsafe methods on CursorMut?

terrorfisch commented 1 month ago

@GrigorenkoPV according to the 8ee96931776c9d2912b379868ae3f7351d67281c commit message:

The ability to mutate keys is also factored out into a separate CursorMutKey type which is unsafe to create. This makes the API easier to use since it avoids duplicated versions of each method with and without key mutation.

Maybe there is more discussion somewhere?

I think one should explore a Key<'a, K>(&'a mut K) type which implements Deref<Target=K> and has a unsafe fn get_mut(&mut self) -> &mut K method.