khonsulabs / bonsaidb

A developer-friendly document database that grows with you, written in Rust
https://bonsaidb.io/
Apache License 2.0
1.01k stars 37 forks source link

View generates inconsistent sort order #237

Closed cortopy closed 2 years ago

cortopy commented 2 years ago

Just trying out bonsaidb for the first time. I've been reading about it and I can't wait to use it. Thanks so much for this amazing project

However, my first attempt at using Views leads to some inconsistent behaviour.

I've copied the BlogPost example into my own crate to test things out. But the thing is this query https://github.com/khonsulabs/bonsaidb/blob/main/book/book-examples/tests/view-example-string.rs#L93

sometimes logs

Retrieved post #7$1 "New version of BonsaiDb released"
Retrieved post #7$2 "New Rust version released"
Retrieved post #1 "New version of BonsaiDb released"
Retrieved post #2 "New Rust version released"

and others this:

Retrieved post #7$2 "New Rust version released"
Retrieved post #7$1 "New version of BonsaiDb released"
Retrieved post #2 "New Rust version released"
Retrieved post #1 "New version of BonsaiDb released"

At first I thought it may be because the example doesn't use any specific sort, but then the docs say that ascending is the default. Indeed, I can see the query is build with forwards scanning.

Adding ascending or descending to the view have the same erratic results

I don't really have a way to reproduce this, except running the test many times

ecton commented 2 years ago

The Sort controls the order of keys being returned. In that example, the key is the category, not the post title. When querying with a range of keys, the sort order will ensure the keys are returned in the order queried.

For example, here is some code that can be pasted into that example:

    for mapping in db.view::<BlogPostsByCategory>().query()? {
        println!("Key: {:?}, Document ID: {}", mapping.key, mapping.source.id);
    }
    for mapping in db.view::<BlogPostsByCategory>().descending().query()? {
        println!("Key: {:?}, Document ID: {}", mapping.key, mapping.source.id);
    }

The output for me for me is:

Key: Some("Cooking"), Document ID: 7$3
Key: Some("Rust"), Document ID: 7$2
Key: Some("Rust"), Document ID: 7$1
Key: Some("Rust"), Document ID: 7$2
Key: Some("Rust"), Document ID: 7$1
Key: Some("Cooking"), Document ID: 7$3

From this, it looks like it's behaving as I hoped it would. The order of multiple entries with the same key is undefined, and unaffected by the sort order.

ecton commented 2 years ago

I forgot to say -- thank you for checking out BonsaiDb! Let me know if you have any other questions.

I have added an actual explanation to the documentation for those functions now so that the expected behavior is hopefully more clear. That's really only a partial solution however, as I still feel like views aren't explained well enough yet.

I'm planning on a major overhaul of view documentation soon, including more examples on different usage patterns.

cortopy commented 2 years ago

Thanks for updating the comments. It does make sense now. Following what you said, I've been reading on Couchdb's map/reduce and I was surprised at how similar it is to bonsaidb. By reading your posts, I knew you took inspiration from it, but now I can see that if you know couchdb, you can expect bonsaidb to be very similar.

With this in mind, I thought it would be nice to add to the User guide (rather than just the docs) something like:

Map/reduce is similar to creating indexes in both SQL and NoSQL databases. However, they are also different in some important ways. BonsaiDb is in alpha stage and the implementation may change. But in the meantime, if you're new to the concept of map/reduce for database indexing, a good place to start may be the CouchDb guide for views as BonsaiDb is heavily inspired by it.

I know that referring to another database may not be wise but at least it's something for now? Their guide was very revealing to me. After reading it, I could see why the expected behaviour for sorting documents with the same key is undefined.

So that's about docs.

On the subject of how to sort blog posts by category, I had assumed that when emitting the same key in the map function, the ordering would the be the same as insertion (i.e.: by document id if not using natural ids). I can see now that I was totally wrong.

But then, this leads me to the goal I was trying to achieve. Let's imagine I have a Blog and I want to show users all posts within the category "Rust". Since there are too many to display on one page, the frontend needs to paginate posts within that category. The Rust category page should only show the first most recent posts and then, have a link to other pages. The query needed would be blog posts by category, sorted by some kind of timestamp. I found an exact question like this on stackoverflow for couchdb

According to that answer, in couchdb, the map function can emit an array like [key, sorting_token]. Having specified the sort token, it would then be possible to query using start_key and end_key to manage the sorting order. However, there is nothing like that in bonsaidb for what I can see. Is this correct?

I can see you can query key ranges, but there is no way to specify a sorting token in the query.

I've been thinking hard about this one, and my experience with DynamoDb came to mind, where you create composite string keys for partition keys. Would this be a good workaround?

In the map function

document.header.emit_key_and_value(format("{}_{}", post.category, post.publication_date), 1);

and then

let rust_posts = db
            .view::<BlogPostsByCategory>()
            .with_key_prefix("Rust".to_string())
            .descending()
            .query_with_collection_docs()?;

This does work, but being new with bonsaidb, I wonder if this could be an anti-pattern

ecton commented 2 years ago

I know that referring to another database may not be wise but at least it's something for now?

It's funny, I actually used to call out how similar it was, but as I've grown further from CouchDB's core design, I started trying to rely on my own docs more. CouchDB's view guide definitely serves as a good guide on the fundamentals, but it breaks down on anything more than the basic concepts as you ran into.

I truly appreciate the feedback.

Another invaluable area of finding inspiration for neat ways to use views is Redis' Indexing guides. As Redis is primarily a key-value store and views are primarily a key-value mechanism, many of these indexing tricks will work in BonsaiDb too.

According to that answer, in couchdb, the map function can emit an array like [key, sorting_token].

This is one of the areas where BonsaiDb differs greatly. In CouchDB, everything is JSON, so it's easy to write code that compares two values -- after all, JSON can only contain a limited number of types.

The way I solved this is to create the Key trait, which allows any implementor to serialize itself into bytes in such a way that ordering is preserved in binary form.

One of the implementations is for tuples:

#[derive(Debug, Clone, View)]
#[view(collection = BlogPost, key = (Option<String>, String), value = u32, name = "by-category")]
pub struct BlogPostsByCategory;

impl ViewSchema for BlogPostsByCategory {
    type View = Self;

    fn map(&self, document: &BorrowedDocument<'_>) -> ViewMapResult<Self::View> {
        let post = BlogPost::document_contents(document)?;
        document
            .header
            .emit_key_and_value((post.category, post.title.to_lowercase()), 1)
    }

    fn reduce(
        &self,
        mappings: &[ViewMappedValue<Self::View>],
        _rereduce: bool,
    ) -> ReduceResult<Self::View> {
        Ok(mappings.iter().map(|mapping| mapping.value).sum())
    }
}

The next question is: how to query? Well, unfortunately tuples don't have an implementation of IntoPrefixRange (this is a newer trait, but a custom type could implement this for easier querying). Instead, we have to manually build a range to represent all possible keys:

    let rust_posts = db
        .view::<BlogPostsByCategory>()
        .with_key_range(String::from("Rust")
            .into_prefix_range()
            .map(|category| (Some(category), String::default())))
        .query_with_docs()?;
    for mapping in &rust_posts {
        let post = BlogPost::document_contents(mapping.document)?;
        println!(
            "Retrieved post #{} \"{}\"",
            mapping.document.header.id, post.title
        );
    }

This is a little ... non-ergonomic, but it will always produce the correct sort order. Common advice for both CouchDB and BonsaiDb would be to limit the length that the sort key can be. In the future, deriving Key will be possible which will make using custom types instead of tuples easier.

Of note, don't try this on main -- this example shows a bug caused by my recent refactor that didn't get caught by unit tests. It works correctly in the released version. Fixed.

cortopy commented 2 years ago

@ecton this is awesome! I thought about a tuple but then I didn't have a clue how I would query by a tuple if I didn't know the second element (the timestamp). This solves it, and I love how it's done

There's a section on the Key trait in the User Guide, but the information there doesn't do justice to everything that Bonsaidb can do already. My entry point to bonsaiDb was the User Guide. I read all sections before daring to write a single line. On retrospect, it feels like an introduction to data modelling and the query API would have been useful.

Would you be open to creating a new section in the "User guide" for the data model and query API? If you help me with the structure you'd like, I'd be happy to start a draft PR for this.

Some ideas off the top of my head:

ecton commented 2 years ago

I would be silly to refuse such an offer! These are all incredibly great ideas to focus on. Additionally, as you encounter things that seem "more complicated" than they should be, please do let me know, as I may be able to improve ergonomics fairly easily for certain use cases.

I don't really know how to structure it -- it's one of the reasons I shipped v0.4.0 without more revisions to the guide. I moved the Key trait section, but it's woefully inadequate. At some point, I am planning on spinning the key module into its own crate so that Nebari and any other key-value store in Rust could use these encoding functions. At that point, I'm not sure where the Key-related documentation should live. Regardless, we can cross that bridge when we get there.

I'm very flexible about the documentation at this stage. I would encourage you to just start and send a pull request whenever you have some content you'd like me to review. I'm an open book and am happy to answer any questions about anything you're uncertain of.

Thank you so much for the offer!

cortopy commented 2 years ago

Great! I'll probably start by writing some more examples I can then refer to. It's a bit hectic on my side, so this may take time, but it'll come

ecton commented 2 years ago

I'm excited to see what you come up with, but please do not feel any pressure on my accord! Thank you again!

martpie commented 5 months ago

@ecton I am facing a similar issue, but not with view, just by querying a standard collection. I had to sort the document by myself (I know the current implementation is inefficient, but I am only using it on very small datasets (dozens of docs).

Here's a code sample:

pub async fn get_tracks(&self, track_ids: Vec<String>) -> AnyResult<Vec<Track>> {
    let docs = self.tracks_collection().get_multiple(&track_ids).await?;

    match self.decode_docs::<Track>(docs) { // decode the doc content
        Ok(mut tracks) => {
            // document may not ordered the way we want, so let's ensure they map to track_ids
            tracks.sort_by_key(|track| track_ids.iter().position(|id| id == &track._id));
            Ok(tracks)
        }
        Err(err) => Err(err),
    }
}

Is this the intended behavior when using get_multiple?

ecton commented 5 months ago

@martpie This is a deficiency in the documentation and potentially missing functionality. The return order is not guaranteed, although in current implementations it will be the binary-sorted key order since that's the order they are contained on disk and how the underlying storage format works.

When writing this I had a choice: add the extra sorting step and possibly slow down some use cases where order didn't matter, or just return them as-is. I decided the latter for the initial implementation.