neuml / txtai

💡 All-in-one open-source embeddings database for semantic search, LLM orchestration and language model workflows
https://neuml.github.io/txtai
Apache License 2.0
9.47k stars 607 forks source link

Feature request : Enhanced Geospatial and Temporal Search #740

Open nicolas-geysse opened 5 months ago

nicolas-geysse commented 5 months ago

Here's a plan to enhance TxtAI with geospatial and temporal search capabilities:

1. Extend indexing for geospatial data:

import geopandas as gpd
from txtai.graph import Graph

class GeospatialGraph(Graph):
    def __init__(self):
        super().__init__()
        self.gdf = gpd.GeoDataFrame()

    def add_node(self, node_id, geometry, **attr):
        super().add_node(node_id, **attr)
        self.gdf = self.gdf.append({'node_id': node_id, 'geometry': geometry}, ignore_index=True)

    def spatial_query(self, geometry, predicate='intersects'):
        return self.gdf[self.gdf.geometry.geom_method(predicate, geometry)]['node_id'].tolist()

2. Implement temporal search functionalities:

import pandas as pd

class TemporalGraph(Graph):
    def __init__(self):
        super().__init__()
        self.temporal_index = pd.DatetimeIndex([])

    def add_node(self, node_id, timestamp, **attr):
        super().add_node(node_id, **attr)
        self.temporal_index = self.temporal_index.append(pd.DatetimeIndex([timestamp]))

    def temporal_query(self, start_time, end_time):
        mask = (self.temporal_index >= start_time) & (self.temporal_index <= end_time)
        return self.temporal_index[mask].tolist()

3. Integrate with existing semantic search:

from txtai.embeddings import Embeddings

class SpatioTemporalSemanticGraph(GeospatialGraph, TemporalGraph):
    def __init__(self):
        super().__init__()
        self.embeddings = Embeddings()

    def add_node(self, node_id, geometry, timestamp, text, **attr):
        super().add_node(node_id, geometry, timestamp, **attr)
        self.embeddings.index([(node_id, text, None)])

    def search(self, query, geometry=None, start_time=None, end_time=None, limit=10):
        results = self.embeddings.search(query, limit)

        if geometry:
            spatial_results = set(self.spatial_query(geometry))
            results = [r for r in results if r[0] in spatial_results]

        if start_time and end_time:
            temporal_results = set(self.temporal_query(start_time, end_time))
            results = [r for r in results if r[0] in temporal_results]

        return results

This implementation:

  1. Uses GeoPandas for geospatial indexing, which is compatible with NetworkX.
  2. Utilizes pandas for temporal indexing, which is already part of TxtAI's ecosystem.
  3. Integrates seamlessly with TxtAI's existing semantic search capabilities.
  4. Provides a simple interface for combined spatio-temporal-semantic queries.

To use this enhanced graph:

graph = SpatioTemporalSemanticGraph()
graph.add_node("1", Point(0, 0), pd.Timestamp("2023-01-01"), "Sample text")
results = graph.search("sample", 
                       geometry=Point(0, 0).buffer(1), 
                       start_time=pd.Timestamp("2022-01-01"), 
                       end_time=pd.Timestamp("2024-01-01"))

This approach extends TxtAI's capabilities while maintaining simplicity and integration with its existing ecosystem.

Citations: [1] https://networkx.org/documentation/stable/auto_examples/geospatial/index.html [2] https://networkx.org/documentation/stable/auto_examples/geospatial/extended_description.html [3] https://github.com/geopandas/geopandas/issues/1592 [4] https://napo.github.io/geospatial_course_unitn/lessons/05-street-network-analysis [5] https://pypi.org/project/networkx-temporal/ [6] https://www.timescale.com/blog/tools-for-working-with-time-series-analysis-in-python/ [7] https://pythongis.org/part1/chapter-03/nb/03-temporal-data.html [8] https://github.com/MaxBenChrist/awesome_time_series_in_python [9] https://unit8co.github.io/darts/ [10] https://www.timescale.com/blog/how-to-work-with-time-series-in-python/ [11] https://github.com/sacridini/Awesome-Geospatial [12] https://www.mdpi.com/1999-4893/10/2/37

nicolas-geysse commented 4 months ago

A more (?) related way to attributes and types: "Extension of Data Sources" point, focusing on integrating geospatial and temporal data while preserving attributes and types:

Extension of Data Sources • Addition: Data import with attribute and type preservation • Addition: Support for geospatial and temporal data • Libraries: qwikidata, geopandas, pandas • Benefits: Enrichment of graphs with structured and geotemporal data

Implementation:

  1. Extend TxtAI's Graph class to support geospatial and temporal data:
from txtai.graph import Graph
import geopandas as gpd
import pandas as pd
from qwikidata.linked_data_interface import get_entity_dict_from_api

class EnhancedGraph(Graph):
    def __init__(self):
        super().__init__()
        self.gdf = gpd.GeoDataFrame()
        self.temporal_data = pd.DataFrame()

    def add_geospatial_node(self, node_id, geometry, **attrs):
        self.graph.add_node(node_id, geometry=geometry, **attrs)
        self.gdf = self.gdf.append({'node_id': node_id, 'geometry': geometry, **attrs}, ignore_index=True)

    def add_temporal_node(self, node_id, timestamp, **attrs):
        self.graph.add_node(node_id, timestamp=timestamp, **attrs)
        self.temporal_data = self.temporal_data.append({'node_id': node_id, 'timestamp': timestamp, **attrs}, ignore_index=True)

    def import_wikidata(self, entity_id):
        entity_dict = get_entity_dict_from_api(entity_id)
        node_id = entity_dict['id']
        attrs = {claim['mainsnak']['property']: claim['mainsnak']['datavalue']['value'] 
                 for claim in entity_dict['claims'] if 'datavalue' in claim['mainsnak']}
        self.graph.add_node(node_id, **attrs)
        return node_id

    def to_geopandas(self):
        return self.gdf

    def to_temporal_pandas(self):
        return self.temporal_data
  1. Implement methods to import and integrate different data types:
    def import_geojson(self, file_path):
        gdf = gpd.read_file(file_path)
        for idx, row in gdf.iterrows():
            self.add_geospatial_node(idx, row.geometry, **row.to_dict())

    def import_temporal_csv(self, file_path, timestamp_col, node_id_col):
        df = pd.read_csv(file_path, parse_dates=[timestamp_col])
        for idx, row in df.iterrows():
            self.add_temporal_node(row[node_id_col], row[timestamp_col], **row.to_dict())

    def spatial_query(self, geometry):
        return self.gdf[self.gdf.intersects(geometry)]

    def temporal_query(self, start_time, end_time):
        mask = (self.temporal_data['timestamp'] >= start_time) & (self.temporal_data['timestamp'] <= end_time)
        return self.temporal_data.loc[mask]
  1. Usage example:
graph = EnhancedGraph()

# Import geospatial data
graph.import_geojson("cities.geojson")

# Import temporal data
graph.import_temporal_csv("events.csv", timestamp_col="event_date", node_id_col="event_id")

# Import Wikidata
node_id = graph.import_wikidata("Q64")

# Perform spatial and temporal queries
cities_in_area = graph.spatial_query(some_polygon)
events_in_timeframe = graph.temporal_query(pd.Timestamp("2023-01-01"), pd.Timestamp("2023-12-31"))

# Convert to GeoDataFrame or DataFrame for further analysis
gdf = graph.to_geopandas()
temporal_df = graph.to_temporal_pandas()

This implementation enhances TxtAI's graph capabilities by:

  1. Supporting geospatial and temporal data alongside the existing graph structure.
  2. Preserving attributes and types when importing data from various sources.
  3. Providing methods to query and analyze the data based on spatial and temporal criteria.
  4. Integrating with external data sources like Wikidata.

Regarding the initial type problem: This implementation indirectly addresses the initial type problem by providing a more robust framework for handling different types of data, including the ability to preserve and query based on node types and attributes. While it doesn't directly solve the specific issue of adding a 'type' attribute to nodes, it provides a flexible structure that can easily accommodate such attributes and more complex data types.

The approach is well-integrated with TxtAI's ecosystem, extending its Graph class and using compatible libraries like geopandas and pandas. It also leverages NetworkX's underlying graph structure while adding geospatial and temporal capabilities on top of it.

Citations: [1] https://github.com/neuml/txtai/blob/master/examples/38_Introducing_the_Semantic_Graph.ipynb [2] https://github.com/neuml/txtai/blob/master/examples/57_Build_knowledge_graphs_with_LLM_driven_entity_extraction.ipynb [3] https://neuml.hashnode.dev/generate-knowledge-with-semantic-graphs-and-rag [4] https://neuml.hashnode.dev/introducing-the-semantic-graph [5] https://neuml.github.io/txtai/examples/ [6] https://networkx.org/documentation/stable/auto_examples/geospatial/index.html [7] https://networkx.org/documentation/stable/auto_examples/geospatial/extended_description.html [8] https://github.com/geopandas/geopandas/issues/1592 [9] https://napo.github.io/geospatial_course_unitn/lessons/05-street-network-analysis [10] https://towardsdatascience.com/3d-spatial-data-integration-with-python-7ef8ef14589a?gi=568600818a62 [11] https://pandas.pydata.org/docs/getting_started/intro_tutorials/09_timeseries.html [12] https://pandas.pydata.org/docs/user_guide/timeseries.html [13] https://pythongis.org/part1/chapter-03/nb/03-temporal-data.html [14] https://pandas.pydata.org/pandas-docs/version/1.2.0/getting_started/intro_tutorials/09_timeseries.html [15] https://www.dataquest.io/blog/tutorial-time-series-analysis-with-pandas/