sdleffler / qp-trie-rs

An idiomatic and fast QP-trie implementation in pure Rust.
Mozilla Public License 2.0
94 stars 24 forks source link

String key ergonomics #13

Open tapeinosyne opened 7 years ago

tapeinosyne commented 7 years ago

Currently, string keys are rather unwieldy. They cannot be used directly in QP-tries, and must either be wrapped individually or funnelled through ancillary inherent methods – loose methods which exist as duplicates without a supporting trait.

Primarily, the issue arises from the requirement that keys be Borrow<[u8]>: Borrow is concerned with abstract properties such as Eq or Hash, but qp-tries manipulate keys on a strictly structural basis, byte by byte. From this stems a certain tension that can be felt most clearly in the public API, where users must rely on different inherent methods depending on the key type.

I outline here three possible paths to better key ergonomics, namely:

Of the three, I believe the first would be the best choice, as it not only guarantees ergonomic string keys but also gives us the opportunity to admit more key types. I will follow up with a PR to get a better feel for the proposal.

Introduce an AsKey trait

Shift the bound on keys fromBorrow to a public trait AsKey (or Radical, if we're feeling whimsical), which provides both a slice view – akin to Borrow itself – and a byte view; the slice view subsumes Borrow and the byte view makes BStr wrappers redundant. (Note that qp-tries already requireBreak as a non-std trait.) As an additional benefit, AsKey could later be implemented for more types expected of a radix tree – notably integers, integer vectors and arrays, and possibly tuples.

The basic idea can be illustrated thus:

pub trait AsKey {
    type Borrowed : ?Sized;

    fn as_key_slice(&self) -> &Self::Borrowed;
    fn as_nybbles(&self) -> &[u8];

    // `Break` could be merged with AsKey, if desired
}

impl AsKey for Vec<u8> {
    type Borrowed = [u8];

    fn as_key_slice(&self) -> &Self::Borrowed { self.as_slice() }
    fn as_nybbles(&self) -> &[u8] { self.as_slice() }
}

impl AsKey for String {
    type Borrowed = str;

    fn as_key_slice(&self) -> &Self::Borrowed { self.as_str() }
    fn as_nybbles(&self) -> &[u8] { self.as_bytes() }
}

// other impls: &str, &[u8], arrays

impl<K, V> Trie<K, V> where K : AsKey {
    fn get<'a, Q>(&self, key : &'a Q) -> Option<&V>
    where K : Borrow<Q>
        , Q : Borrow<K::Borrowed> + ?Sized
    {
     //…
    }
}

The main disadvantage, here, is that coherence rules prevent us from providing much in the way of blanket impls: we can cover concrete types from the standard library, and that is about it. Nevertheless such impls should suffice, as users can rely on conversion traits and deref coercions at the call site.

(In order to provide blanket impls, we would need to parametrize the trait, turning it into AsKey<Borrowed>, and track Borrowed with PhantomData internally. That should allow us to implement the trait for K : Borrow<[u8]> and K : Borrow<str> without overlap, by parametrizing each impl with its respective borrowed form. Not complicated, just… cluttered.)

Make keys AsRef<[u8]>

Relax the bounds from Borrow<[u8]> to AsRef<[u8]>, allowing tries to accept string keys without further changes. This is by and large analogous to what is done now, internally, as we only ever work with byte slices. Other byte-based tries published on Crates.io (fst comes to mind) also admit AsRef keys.

To justify the transition, I will note that Borrow is meant to enforce abstract properties, forming a contract between things that – be they owned or borrowed – hash the same way, remain equal under Eq, have the same order under Ord, and so on. However, QP-tries need only rely on structural equality, i.e. they need only know the actual bytes that form a key, and nothing beyond that. From this perspective, AsRef<[u8]> is better tailored to our needs than Borrow, as its purpose is to offer a cheap reference-to-reference conversion to the parameter, and a reference to [u8] can be little else than an agnostic byte view.

Having said that, the fact that trie keys would not follow Eq may prove confusing, and the new bounds should be clearly documented with explanations and examples.

Wrap the trie, not the keys

Given that keys are homogeneous, there is little benefit in wrapping them individually and using ad-hoc methods to reduce boilerplate. Rather, we can introduce a newtype StringTrie<V>, wrapping a plain qp-trie over Borrow<[u8]> keys. Trie methods are moved to a trait, which StringTrie can implement by delegation to the inner trie – with the exception of prefix lookup, where StringTrie requires index validation.

There are degrees of sophistication to this approach and how generic it should be, but the resulting interface should allow users to forgo wrappers after trie instantiation:

pub struct StringTrie<V> { trie : Trie<Vec<u8>, V> };

pub trait QPTrie<V> {
    type Key : ?Sized;

    fn get<'a, Q>(&self, key : &'a Q) -> Option<&V>
    where K : Borrow<Q>
        , Q : Borrow<Self::Key> + ?Sized;

    // …
}

impl<K, V> QPTrie<V> for Trie<K, V> where K : Borrow<[u8]> {
    type Key = [u8];

    // …
}

impl<V> QPTrie<V> for StringTrie<V> {
    type Key = str;

    fn get<'a, Q>(&self, key : &'a Q) -> Option<&V>
    where K : Borrow<Q>
        , Q : Borrow<Self::Key> + ?Sized
    {
        self.trie.get(key.borrow().as_bytes())
    }

    // …
}

This is the least impactful approach, in that it does not ultimately alter bounds, but merely lifts the newtype to the outer structure of the trie. However, it does not address the issues underlying string keys.

sdleffler commented 7 years ago

I like the idea of a StringTrie, but I prefer the AsKey method. I don't like AsRef<[u8]> so much because it implies that you're in a way converting something to a byte slice, rather than it is to some extent inherently a byte slice and you're just borrowing it, and that doesn't seem to mesh with the way Rust prefers Borrow over AsRef for the extra implicit invariant in, say, HashMap. I would insist that AsKey be concerned with abstract properties - it should definitely and absolutely be the case that if the byte slices extracted from two keys are equal, then the keys themselves should be equal. This is the reason I originally picked Borrow. Open question: as we're switching from Borrow to AsKey, should the byte slices also be ordered the same? Does that even matter? The "ordering" in a QP-trie is nybble-based so it's not really possible to re-order based on some custom ordering. Interestingly enough, Strings are Ord via simply redirecting to the internal Vec<u8> (it's just a #[derive].)

Perhaps Break could be unified with AsKey? To a certain extent Break was a band-aid on the way that Borrow<[u8]> didn't express everything that you need for a key to be a key, in a trie. On the other hand, you only need Break for grabbing correct prefixes, and the trie works fine as a key-value map with a key which doesn't implement Break. Thoughts?

tapeinosyne commented 7 years ago

@sdleffler I would insist that AsKey be concerned with abstract properties - it should definitely and absolutely be the case that if the byte slices extracted from two keys are equal, then the keys themselves should be equal. This is the reason I originally picked Borrow.

Definitely so, although there might be a slight misunderstanding here: in the opening message, I considered equality of byte slices (“bytewise equality”) a concrete property as opposed to abstract; with abstract, I meant things like Eq. For example, two caseless strings or (I think) standard library Paths can be equal under Eq even if their byte composition is different; nonetheless qp-tries would only see unequal bytes, and associate the two keys to different entries. This really is a matter intrinsic to qp-tries – they operate as byte-based automata, not symbolic ones.

Perhaps you meant instead that keys should be agnostic to ownership, which is the mandate of Borrow? That's the case for all proposed impls, but it would be perfectly reasonable to make it a requirement for all implementors. Such a requirement can be more formally specified by retaining the Borrow relationship between the argument type at the call site and AsKey::Borrowed, as in:

pub fn get<'a, Q: ?Sized>(&self, key: &'a Q) -> Option<&V>
where
    Q: Borrow<K::Borrowed>

Open question: as we're switching from Borrow to AsKey, should the byte slices also be ordered the same? Does that even matter? The "ordering" in a QP-trie is nybble-based so it's not really possible to re-order based on some custom ordering. Interestingly enough, Strings are Ord via simply redirecting to the internal Vec (it's just a #[derive].)

It's as you say: there can be no “canonical” order over UTF-8 strings other than the lexicographic order on bytes, which is the one we already use; going beyond requires collation and locales, but even those only concern islands of codepoints (because there is no meaningful ordering relationship between symbols from different scripts, or between letters and punctuation, and so on).

Perhaps Break could be unified with AsKey? To a certain extent Break was a band-aid on the way that Borrow<[u8]> didn't express everything that you need for a key to be a key, in a trie. On the other hand, you only need Break for grabbing correct prefixes, and the trie works fine as a key-value map with a key which doesn't implement Break.

I agree that it's reasonable to isolate different capabilities to different traits, and similarly apply bounds to affected methods only. Moreover, the notion of Break may prove useful later on, if we were admit other types as keys.

sdleffler commented 7 years ago

@tapeinosyne sorry for the late response! I have to admit that it never crossed my mind when I first wrote this that the equality/ordering between keys is strictly "bytewise" - mostly because when I wrote this, I was intending to use it strictly for bytes. The idea of using it for strings and other types came later (although I now find myself wanting to use it for paths.) I think it's definitely a must that keys be agnostic to ownership in the same manner as Borrow, with respect to functions such as get and iter_prefix.

sdleffler commented 7 years ago

@tapeinosyne on the topic of Paths, I'd like to ask your opinion - do you think it's reasonable behavior to implement Break for Paths on Component boundaries?

tapeinosyne commented 6 years ago

I'd say so, certainly. (Although Path keys probably won't see much use.)

Going back to AsKey itself, I'd be happy to revisit #14 if you're up for a review.

sdleffler commented 6 years ago

Gladly! Can't promise to be the most responsive for the next month or so but if you submit a PR I'll do my best to get to it in a timely manner. You want me to take a look at it as-is or do you have changes to make first?

tapeinosyne commented 6 years ago

By all means, take your time. (Hardly unreasonable, innit, considering that it took me four months to reply "yes" above!) The PR can already be reviewed for issues of substances with AsKey; the remaining concerns should be minor (additional impls, updating docs/readme, style) and quickly addressed.

I'll update the opening message of #14 to better reflect this.

tapeinosyne commented 4 years ago

@sdleffler, would you be available for revisiting this issue? I would be happy to revise or update #14 as we like.

sdleffler commented 4 years ago

I would! There were a few bugfixes since #14 was submitted - would you mind rebasing #14? It honestly looks good enough to me now that I've finally had the time to go over it that I don't think I can argue for much in the way of revision. Let's get it rebased and then I'll check it out and probably merge it.

tapeinosyne commented 4 years ago

Certainly. I'll need to read it over myself—it's been a while since I wrote it.

tapeinosyne commented 4 years ago

A question: do we want to deprecate the BString wrapper, or baldly remove it?

sdleffler commented 4 years ago

Let's deprecate it for now. If it gets in the way of the API change, then it can go.

tapeinosyne commented 4 years ago

14 is ready for review, although documentation and tests are yet to be fully updated.

The reason I have left them midway is that, hands-on, I rather feel that AsKey may win us too little over a simpler AsRef<[u8]> + Break bound. My primary motivator for AsKey was supporting more key types, but:

All of which leaves AsKey looking surplus to requirements. How do you feel about it?

[1] Notably, the knot recursive DNS solver uses qp-tries.