quantleaf / probly-search

A lightweight full-text search library written in Rust that provides full control over the scoring calculations
MIT License
60 stars 5 forks source link

Migrate to methods of Index #15

Closed tmpfs closed 2 years ago

tmpfs commented 2 years ago

Closes #13.

tmpfs commented 2 years ago

@marcus-pousette, I think this is more or less there. Later I think it would be nice to spend a bit of time improving the API documentation but for the moment I think this is good enough.

One thing that bothers me a bit is that query() takes a mutable reference to the Index and that seems counter-intuitive to what a query function should do. I wonder if there is any way to remove the mutability or otherwise to use interior mutability so we could use &self - maybe better in a separate issue/PR.

marcus-pousette commented 2 years ago

Thanks for the PR! @tmpfs

The only thing I am considering is that, if it is necessary to have "everything" is in the index file (query just contain tests with this PR). I see why you made this changes (because the query fn is now a member of Index). I wonder if it would be possible to have separation of this, so I as a reader can more quickly get an overview/insight into different parts of this library without having to read through a long index file. I am also taking height for a future where the query capabilities are much more feature rich. Which would make the index file even larger if the same approach is used.

For example you could perhaps make

use crate::query::query;

impl Index {
    fn query(....) {
        query(...)
    }
}

instead of

impl Index {
    fn query(....) {
       " ----- 100 lines of code ----- "
    }
}

fn query_helper_method_1 ( ... ) {}

fn query_helper_method_2 ( ... ) {}

What do you think?

I fully agree that passing a mutable index to the query is not right / looks weird. The reason for this is that is that I am vacuuming removed documents during query time since you might be iterating over them. Not sure if this feature is worth the price of having to pass a mutable reference. Maybe it would be better to return pointers to nodes that you later can remove instead, or something.

tmpfs commented 2 years ago

Sure @marcus-pousette that sounds reasonable to separate out the modules. I will update this PR with those changes soonish.

I thought that automatic vacuuming was the reason for the mutable reference and thanks for clarifying that. I would argue that it is not worthwhile to automatically vacuum. When i was integrating with my project i had a nicely encapsulated index but had to expose a mutable reference in order to query and i would like to remove that as the owner of the index should only be able to mutate it.

For consumers of the library that need vacuuming then having the query ignore the removed documents is enough i think. Eventually the index will be purged when vacuum is called explicitly.

tmpfs commented 2 years ago

@marcus-pousette, please take another look, I think it works better now; in particular the trait bounds for T are clearer and more consistent we only define them twice (in the index and query modules) for each impl Index block 👍

marcus-pousette commented 2 years ago

Yes, looks great now!

Now. query module is completely independent of index (i.e. there can be an index without any query capabilities potentially).

tmpfs commented 2 years ago

Yes, we could put query behind a feature flag if necessary. I would imagine we would certainly want to do that for the fuzzy matching feature but not sure whether it makes sense for query - is the index useful without query?

marcus-pousette commented 2 years ago

I thought that automatic vacuuming was the reason for the mutable reference and thanks for clarifying that. I would argue that it is not worthwhile to automatically vacuum. When i was integrating with my project i had a nicely encapsulated index but had to expose a mutable reference in order to query and i would like to remove that as the owner of the index should only be able to mutate it.

For consumers of the library that need vacuuming then having the query ignore the removed documents is enough i think. Eventually the index will be purged when vacuum is called explicitly.

I agree with you. I realize now that there might even be immediate performance benefits to do this your way since we are to remove some checks. I created an issue for this: #16

marcus-pousette commented 2 years ago

Yes, we could put query behind a feature flag if necessary. I would imagine we would certainly want to do that for the fuzzy matching feature but not sure whether it makes sense for query - is the index useful without query?

Query capabilities will always be necessary in someway but now at least you can implement your own query module and not include this query module when depending on this library (if there is a feature flag).

I am not sure how far back one need to go to implement the fuzzy matching. But I am quite sure you do not need to reinvent everything in the query module, there must be some smart abstraction to make that allows existing functionality coexist nicely with fuzzy matching features. We could take this in the #12 issue, but maybe also think about what fuzzy matching could represent in a bigger picture, for example, many "fuzzy" matching systems today are using synonym/word2vec modules to measure distances from what you searching for and what exist in the index. Fuzzy matching is in some sense a property where we set the allowed "error distance" where distance could represent Levenshtein distance, or whatever distance type that is feasible for your need. Continuing on this, perhaps it would be interesting to pass an "error" function, that allows you during query to set the precision of your query. This error function could itself be a function of some distance measurement algorithm, like Levenshtein or word2vec word distance norm

tmpfs commented 2 years ago

Thanks for landing this @marcus-pousette, I agree that we shouldn't need to make many changes to support fuzzy matching. Allowing complete control of query through a feature flag is possible but I think a better approach is to allow complete control of the query behavior (scoring, matching, filtering, sorting) via the arguments to query.

First I will deal with #16 and I want to spend a bit more time on the documentation to help my understanding of the code.