Core Temporal Graph Data API Discussion

shenyangHuang commented 1 week ago

How to store and process temporal graph data is a central topic to any temporal graph library. In this project, we aim to store and process both discrete and continuous time dynamic graphs, forming the first library that operates on both types of temporal graphs. How we store and process data is also closely linked in operations such as neighbor sampling #1 and computing graph statistics.

Main Operators

Here are the main operations that need to be supported by such data format.

Graph Aggregation Support: aggregate over the temporal graph over a given period of time, either across multiple snapshots or over a defined time interval.
- [x] aggregate_graph()
Type Conversion Support: conversion from CTDG to snapshots and DTDG to events
- [ ] to_snapshots() and to_events()
Discretization Support: discretize CTDG and DTDG to coarser time granularity, often into snapshots.
- [ ] to_snapshots() add argument for time granularity
Dynamic Node & Edge Feature Support: support optional features on each temporal edge as well as optional node feature for any given timestamp
- [ ] Implement GraphStore and EventStore
Batch / Snapshot Learning Support: output into formats that are suited for both snapshot-based and event-based methods in either a list of snapshots or batches of edges.
- [ ] add dataloader for snapshots and batches of edges, perhaps adding a specialized function for it

Implementation

See PR #8 for implementation

Implementation see data.py

Data Classes: BaseData, inherited by CTDG and DTDG class.

aggregate_graph(start_time, end_time) in BaseData
to_events() and to_snapshot() in CTDG and DTDG

Data Structure

The underlying data structure must be able to quickly search, sort, add and delete any time steps. Therefore, I propose a core structure of GraphStore dictionary ={timestamp: EventStore}. EventStore can be a dictionary with keys node_feat , edge_feat.

EventStore["node_feat"] is a dictionary of (dynamic) node features. EventStore["node_feat"] = {node_id: vec}.
EventStore["edge_feat"] is a dictionary of edges and their edge features. EventStore["edge_feat"] = {(src, dst): vec}.
GraphStore class can also have a static_node_feat property which is the standard node feature matrix that doesn't change over time.
GraphStore class can also have a static_graph_feat property which stores graph level features, can be added later.

diagram-20241119

shenyangHuang commented 1 week ago

Fast Operations and Limitations of Data Structure

By implementing the temporal graph data as a dictionary with time as keys there comes both fast and slow operations depending on the need.

Fast Operations:

Subsample a graph by time interval start_time and end_time
Retrieve edges, edge features and node features corresponding to a given timestamp t
Inserting / Deleting information for a set of specific timestamps.
Sort the graph by time (by default the dataset should be sorted chronologically when loaded)

Slow Operations:

Retrieving interactions / features of a set of specific nodes
Retrieving feature / recurrence of a set of specific edges

The slow operations might be critical, especially when retrieving the temporal neighborhood of a node at a certain point in time, which is often used in models such as TGN

Jacob-Chmura commented 1 week ago

@shenyangHuang This all looks good at a high level. I suggest that we move forward with the data structure you proposed. Once we have good testing CI, we can run perf tests to see where and how severe the bottlenecks are when running real models

Some additional comments

How do we expect the load distribution to look across this API? As you mentioned, the Sampling module will be a core client of this API, which will spam graph aggregations/range queries quite often. I assume Type Conversion and Discretization would only be done once during the training phase? Do we actually need to materialize the fully converted graph?
Integrating with the torch data loader API (torch geometric data loader) will be crucial. I think most people would like to work with a Dataloader to auto-sample their graph and give them the graph tensors they need for out-of-the-box samplers.
A potentially helpful mental model is to think of the underlying features as static entities, and our core temporal graph data API provides a good way of indexing over this data. Specifically, when we call to_snapshots, to_events, discretize, and aggregate, what we want in return is a view over the underlying nodes/edge tensors that gives us quick range queries and neighbour queries. Unless we need true tensor copies, perhaps just building and storing several of these indices is a good way forward. Client needs multiple discretizations and a specific neighbourhood sampler? Build these indices up front and in parallel. Since the queries come chronologically, we could even build these indexes asynchronously.
In an online setting (i.e. deployment), we must support a separate but equally important operator: edge insertion. The data structures you proposed is pretty great for this, but it's worth mentioning that incremental mutations should be made quick for edge-based indexes if the online setting is something OpenDG is aiming to be used for
Do we want to support heterogeneous temporal graphs? If so, can we get away with just storing node type/edge type info at the bottom of our Graph structure (in some Node, Edge objects) or does that lead to irregular tensors on GPU? More on the design level, do heterogeneous clients require some additional interface? (e.g. extract subgraph by node type within time range)
Just for clarification, when we to_snapshots() a CTDG, it's possible that the same edge $e = (vi, v_j)$ occurs multiple times in the same time partition. Do we take the last one or each one? If this is not obvious, it should probably be part of the API so the client can decide

fpour commented 1 week ago

some suggestions:

What about usingEventStore instead of DataStore?
For EventStore["edge_feat"]: is it better to save edges as tuples of edge_idx or (src, dst)?

shenyangHuang commented 1 week ago

How do we solve issues related to the same edge (same node) multiple time at the same timestamp? Do we support it or ignore it?

shenyangHuang commented 3 days ago

For node feature that is dynamic but rarely change over time, how do we store it efficiently without copying it many times? (suggestion from Guillaume) Maybe we should have a specific class for persistent networks (networks that models relationships with a duration), future work

shenyangHuang commented 2 days ago

Matthias feedback (avoid nested dictionaries): EventStore just have 1. edge_index (Tensor), 2. edge_feat (Tensor, # edges x feat), 3. node_index (Tensor), 4. node_feat (# node x feat)

can we not use dictionary? can we just keep tensors, append to it. Mapping from timestamp.

shenyangHuang commented 2 days ago

don't need CTDG and DTDG, just have a single class. maybe use has to specify the format of their timestamp.

maybe we should design with node_type and edge_type in mind

shenyangHuang / openDG