llmware-ai / llmware

Unified framework for building enterprise RAG pipelines with small, specialized models
https://llmware-ai.github.io/llmware/
Apache License 2.0
4.87k stars 1.11k forks source link

Query().apply_custom_filter method needs to be improved #272

Open virunew opened 9 months ago

virunew commented 9 months ago

Query().apply_custom_filter does simple exact search for custom filters. This does not server the purpose most of the time ( at least for me ) . I need to filter using mongodb operators like $regex, $ne , $ge etc which is ignored by this method. This method is used by semantic_query method which , which results in filter filtering everything out. If we are filtering the text result from semantic query, we should be able to support the mongodb operators like $regex, $eq etc. We could use library like pymongoquery to support mongo operator like filtering from our results without making a trip to the mongo databse.

shneeba commented 9 months ago

does simple exact search for custom filters

I've found this an issue as well. One way I was thinking of this is if there could be a way to do the semantic search against the file name as well as the text block (user defined, text block only, file name only, both). This would involve embedding the filename and storing this in milvus (or your db of choice) as well. We already store the blockID in there for the text. I need to dig a bit deeper into the embedding process as I don't think it's a small change (🤞 it could be).

This could keep it in line with doing things with natural language and prevent having to configure your regex strings as well.

virunew commented 9 months ago

I just wrote this piece of code, this one can match mongodb operators like $eq, $ne and $regex etc.. Its using mongoquery (https://pypi.org/project/mongoquery/) package to do the match. mongoquery has some limitation with regex, but that can be remedied by compiling the regex expression

    def apply_filter_including_mongodb_operators(self, data, filter_dict):

        # filter_dict is a dict with indefinite number of key:value pairs - each key will be interpreted
        #   as "$and" in the query, requiring a match against all of the key:values in the filter_dict
        # validate filter dict
        # Using mongoquery to match normal strings as well as mongodb operators
        import mongoquery
        query_list = []
        for field, expression in filter_dict.items():
            if isinstance(expression, dict):
                # expression is a mongo operator
                if "$regex" in expression:
                    # expression's key is $regex then compile and create query
                    regex_pattern = expression['$regex']
                    compiled_pattern = re.compile(regex_pattern, re.IGNORECASE)
                    query_list.append(mongoquery.Query({field: {"$regex": compiled_pattern}}))
                else:
                    # its other mongo operator except $regex , just use expression for query
                    query_list.append(mongoquery.Query({field: expression}))
            else:
                # non mongo operator expression, just use as it is
                query_list.append( mongoquery.Query({field: expression}))
        if all(query.match(data) for query in query_list):
            return True
        else:
            return False

    #modified
    def apply_custom_filter(self, results, custom_filter):
        filtered_results = []
        for result in results:
            if self.apply_filter_including_mongodb_operators(result, custom_filter):
                filtered_results.append(result)
        return filtered_results

tested apply_filter_including_mongodb_operators() method with following data:

filter_dict = { "special_field1": {"$ne":"VIRENDRA"}, "special_field2": { "$regex": "^.*2024$"}, "content_type": "text" }

data = { "special_field1": "virendra", "special_field2": "year 2024", "content_type" : "text" }

So this function can match:

  1. strings, e.g. 'text' in my example
  2. regex, e.g. second element of my filter_dict
  3. mongodb operators ( e.g. $ne as in the first element of filter_dict)
virunew commented 9 months ago

does simple exact search for custom filters

I've found this an issue as well. One way I was thinking of this is if there could be a way to do the semantic search against the file name as well as the text block (user defined, text block only, file name only, both). This would involve embedding the filename and storing this in milvus (or your db of choice) as well. We already store the blockID in there for the text. I need to dig a bit deeper into the embedding process as I don't think it's a small change (🤞 it could be).

This could keep it in line with doing things with natural language and prevent having to configure your regex strings as well.

I think this would be really useful, I would even go further to make it flexible so that we can search in any number or fields as per the user's wish. I can already think of use cases involving it. e.g. if you are searching in a time sensitive document ( say news) , we may want to insert a date/month/year of news in one of the custom fields and then also search it and sort it as per that custom field

shneeba commented 9 months ago

apply_filter_including_mongodb_operators() method

This is really cool and really expands the custom filter use, thanks for sharing, I'll have a play around!

Yep fully agree. I think it's got some real legs!