neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
8.69k stars 573 forks source link

Expand the graph capabilities to enable advanced graph traversal. #540

Closed dustyatx closed 6 months ago

dustyatx commented 1 year ago

REQUEST

I hope this isn't to overwhelming of a request. This would make TxtAI's graph act more like other graph DBs I've used and would open up a lot of possibilities for more sophisticated and accurate queries.

I've added links to the other requests I made such as node id assignment on insert and filtering during a graph traversal. I provided some representative code and document structures but that is not to imply I am recommending any particular kind of implementation, I just figured it would add it for clarity.

Please correct me if I've missed existing features that can accomplish this, I have done my best to understand the documentation but I still have a lot to learn.

Feature Overview:

  1. Custom Node ID Assignment, advanced graph walk using edge & metadata filtering.

Why This Matters:

  1. Adding the capability to assign node IDs upon insertion offers users greater control over their dataset, ensuring consistency especially if migrating or synchronizing data across systems.
  2. The enhanced filtering options on both edges and metadata further refine data retrieval, making the database even more versatile.

Detailed Requirements:

  1. Custom Node ID Assignment: - Upon node insertion, allow for an optional custom ID to be specified. - If no ID is given, default to the current automated assignment method. #525

    a. Overwrite checks to make sure you don't accidentally clobber existing nodes.

  2. Edge Filtering during Traversal: - Permit users to indicate specific edges they'd like to traverse. - This targets queries for more precise outcomes. #534

  3. Metadata Filtering: - Users should be able to filter nodes based on associated metadata. - Important for scenarios where temporal relevance or other specific metadata factors are crucial. #534

    a. Type functions: timestamp ranges, string matches, integer & float operators

  4. Statically Defined Edges: Allow users to add custom static edges between nodes, facilitating manual definition of relationships. This is useful in cases where relationships are already known and do not need to be inferred. #525

EXAMPLE:

  1. Starting Point: We initiate our graph walk at the node representing the concept of "Amusement Parks".

  2. Similar Business Types: From there, we branch out to business type nodes that are conceptually proximate to "Amusement Parks", filtering by a similarity threshold of 0.8 or greater.

  3. Company Discovery: We then dive deeper into individual company nodes nested within these conceptually similar business types. For instance, under the business type "Theme Parks", we discover "Adventure Kingdom Florida".

  4. Employee Lookup: Our traversal continues to the "People" node to identify employees associated with the discovered companies.

  5. Occupation Filtering: From these individual "People" nodes, we scrutinize their occupations, ensuring a focus on roles that are akin to "Maintenance Technicians", based on similarity metrics.

  6. Date Filtering: Next, we filter out individuals whose records were updated post-2022-12-31.

  7. State Filtering: The graph walk progresses further by examining the residence state of these individuals, specifically honing in on "Florida".

  8. City Exclusion: Within Florida, we ensure to exclude individuals residing in "Orlando" from our final selection.

  9. Name Retrieval: Lastly, we pull out the individual's name from each qualifying "People" node to wrap up our comprehensive graph walk.

  10. Result Compilation: Our resultant data is an aggregated list of names that fit our exhaustive criteria.

VISUALIZATION

[Amusement Parks]
|
+--> [Business Types (Similarity >= 0.8)]
     |
     +--> [Theme Parks]
     |    |
     |    +--> [Adventure Kingdom Florida]
     |         |
     |         +--> [People]
     |              |
     |              +--> [John Doe]
     |              |    +--> Occupation: Ride Maintenance Technician (Similarity: 0.95)
     |              |    +--> Last Updated: 2022-05-12
     |              |    +--> State: Florida
     |              |    +--> City: Miami
     |              |
     |              +--> [Jane Smith]
     |              |   +--> Occupation: Park Equipment Technician (Similarity: 0.85)
     |              |    +--> Last Updated: 2021-11-10
     |              |    +--> State: Florida
     |              |    +--> City: Tampa
     |
     +--> [Adventure Parks]
          |
          +--> [Florida Explorers Haven]
               |
               +--> [People]
                    |
                    +--> [Alice Brown]
                    |    +--> Occupation: Ride Safety Technician (Similarity: 0.82)
                    |    +--> Last Updated: 2020-10-20
                    |    +--> State: Florida
                    |    +--> City: Fort Lauderdale
                    |
                    +--> [Bob White]
                    |    +--> Occupation: Theme Park Mechanic (Similarity: 0.88)
                    |    +--> Last Updated: 2021-07-07
                    |    +--> State: Florida
                    |    +--> City: Jacksonville

JSON DOC:

{
    "id": "unique_identifier_for_node",  # Unique ID assigned to each node on creation
    "type": "node_type",  # Categorical identifier (e.g., business_type, people, occupation, location)
    "text": "vector_data_for_similarity_search",  # Vector representation for similarity search
    "metadata": {  
        "name": "readable_name_or_title",  # Display name or title for the node
        "timestamp": "2022-12-31T23:59:59Z",  # ISO format timestamp for the last update
        # ... other metadata attributes can be added as needed
    },
    "edges": {  
        "static": {
            "employee_of": [(node_id1, weight1), (node_id2, weight2)],  # Nodes connected through static edges
            "another_edge_type": [(node_id3, weight3)]
        },
        # ... add other static edge types as needed
        "similarity": [(node_id4, 0.9), (node_id5, 0.85)]  # Nodes connected through similarity measure
    }
}

PSEDUO CODE

SDK:

class GraphDatabaseEngine:
    def __init__(self):
        # Assume internal structures: self.nodes and self.edges
        pass

    class GraphWalkAdvanced:
        def __init__(self, parent_db):
            self.db = parent_db

        def find_nodes_by_similarity(self, text, threshold=0.8, limit=5):
            # Using vector representation and similarity function, find nodes similar to text
            similar_nodes = []  # Filled with nodes after comparison
            # ... some logic ...
            return similar_nodes[:limit]

        def find_nodes_by_edge(self, current_nodes, edge_type, limit=10):
            next_nodes = []
            for node in current_nodes:
                next_nodes.extend(self.db.edges[node.id][edge_type][:limit])
            return next_nodes

        def filter_nodes_by_metadata(self, nodes, metadata):
            key, operation, value = metadata
            if operation == '<':
                return [node for node in nodes if node.metadata[key] < value]
            elif operation == '==':
                return [node for node in nodes if node.metadata[key] == value]
            # ... other possible operations ...

        def execute(self, start_node_text, query):
            current_nodes = [self.find_nodes_by_similarity(start_node_text)]
            for step in query:
                qtype = step['type']
                if qtype == 'similarity':
                    current_nodes = self.find_nodes_by_similarity(start_node_text, step['threshold'], step['limit'])
                elif qtype == 'static_edge':
                    current_nodes = self.find_nodes_by_edge(current_nodes, step['edge_type'], step['limit'])
                elif qtype == 'metadata_filter':
                    current_nodes = self.filter_nodes_by_metadata(current_nodes, step['metadata'])
            return current_nodes

User code:

Scenario 1: Start with Similarity to 'Amusement Parks' and End at a specific node via text match

query1 = [
    {'type': 'similarity', 'threshold': 0.8, 'limit': 5},
    {'type': 'static_edge', 'edge_type': 'EMPLOYEE_OF', 'limit': 5},
    {'type': 'end_node_text', 'text': 'Disney'}
]
results1 = db.execute_graph_walk(start_node_text="Amusement Parks", query=query1)

Scenario 2: Start with Similarity to 'Amusement Parks' and End with Similarity to 'Water Parks'

query2 = [
    {'type': 'similarity', 'threshold': 0.8, 'limit': 5},
    {'type': 'static_edge', 'edge_type': 'EMPLOYEE_OF', 'limit': 5},
    {'type': 'end_node_similarity', 'threshold': 0.8, 'limit': 5}
]
results2 = db.execute_graph_walk(start_node_text="Amusement Parks", end_node_text="Water Parks", query=query2)

Scenario 3: Start with Similarity to 'Amusement Parks' and End with nodes connected via 'SUPPLIER' in Florida

complex_query = [
    {'type': 'similarity', 'threshold': 0.8, 'limit': 5},
    {'type': 'static_edge', 'edge_type': 'EMPLOYEE_OF', 'limit': 10},
    {'type': 'metadata_filter', 'metadata': ('occupation', 'Ride Maintenance Technician')},
    {'type': 'static_edge', 'edge_type': 'SUPPLIER', 'limit': 10},
    {'type': 'metadata_filter', 'metadata': ('last_updated_date', '<', '2023-01-01')},
    {'type': 'metadata_filter', 'metadata': ('location', 'Florida')}
]
results_complex = db.execute_graph_walk(start_node_text="Amusement Parks", query=complex_query)
davidmezzetti commented 1 year ago

Thank you for aggregating this all up into a single issue.

It's going to take me a bit to digest and think about. But I have long thought there was more to do with the graph component and this helps.

Ideally, I'd like for the similar clause to use the graph index when specified and take advantage of the precomputed/predefined relationships. Feels like there is a way for the similar clause or a new relationship clause to have parameters like you show above in the user code section.

Most of the pieces are already in the graph component, it's just a matter of properly wiring the connections with the embeddings component.

dustyatx commented 1 year ago

Very exciting to hear that you have a lot of these pieces in place.

Personally I like relationship with different types, similarity, statically defined, recommended. In this case the difference between similarity and recommended would be how you infer the relationship, either through vector ANN or a graph calculations.

What I'm theorizing is that if we can have edges with different types of connections, I can better control what I feed my LLM Agents. If I need known facts to answer a question then I will only want to use statically defined relationships, things I know are true. Being able to find closely related things is really dependent on how you structure the data but I think if I push the similarity down and I grab items that are further away, I can feed facts that could add some creativity that closely related things wouldn't find. I'll run that through a LLM to decide if it's relevant or not.. So I can envision if I query for eBikes, close similarity is bikes, but further away I might get scooters, electric skateboards that aren't directly related to the query but still provide interesting insights that might not have normally been considered.

davidmezzetti commented 10 months ago

Apologize for continuously pushing this to the right. It's an important addition for me, just keeps getting overcome by other priorities.

dustyatx commented 10 months ago

@davidmezzetti you have absolutely no reason to apologize, you are not obligated to contribute anything more than what you choose to. I am very grateful that you've invested so much in this project and it's become my goto VectorDB. I want to see you thrive and I'm doing my best to get it attention whenever I can.

davidmezzetti commented 7 months ago

Just wanted to provide an update here. I have #525 complete locally and I have a plan for #534. I believe that all the filtering you describe can be done with a SQL statement. I can then use those results to create a subgraph that can then be walked. The subgraph would have all the filters applied.

Should lead to some really exciting things, especially when used as an input for RAG.

dustyatx commented 7 months ago

That is exciting news! This would be really helpful to what we're working on now.

My current approach is a very large search which retrieves about 3k records and then I use a LLM to filter down to about 300. This approach casts a very wide net but it brings back a lot of extraordinary contextually related stuff, things that I could never get from a traditional knowledge graph.

I've created a bunch of metadata for filtering and I have some specially crafted nodes to create really great similarity matches for a walk. I should have a really good dataset to test with when you're ready to send something my way.

davidmezzetti commented 7 months ago

In reading this more, I think advanced path traversal is a separate but related feature to #644. While a query can filter the graph down, having a rule-based language added to graph.showpath would be a nice addition.

It would be nice if there is a standard graph traversal filtering language. I've seen GQL but going to see if that is overkill.

This could be a really cool feature!