Time Series Data batching

pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch

https://pyg.org

MIT License

21.31k stars 3.65k forks source link

Time Series Data batching #1694

Open MichailChatzianastasis opened 4 years ago

MichailChatzianastasis commented 4 years ago

❓ Questions & Help

Hey, I have a time series problem , where i have data of shape [ n_samples , 6* Data Object] , so i want to represent every sample with 6 graphs. When i try to apply DataLoader ( https://pytorch-geometric.readthedocs.io/en/latest/modules/data.html ) i dont get the expected result, as it doesnt concatenate the samples in the first dimension. Generally, is there a method to handle time series problems? Thanks in advance

rusty1s commented 4 years ago

This is a tricky problem, and I'm working on better temporal graph support for PyG at the moment. For your case, it might be best to encode your 6 dynamic graphs into a single data object, e.g. by manually stacking edge_index diagonally or by saving the edge connectivity separately for different time steps, e.g.:

data.edge_index1 = ...
data.edge_index2 = ...

ZikangZhou commented 3 years ago

I encountered the same problem. I read the document and found the way to handle "pairs of graph", but in my case a time series contains many timesteps, let's say 20. The number of timesteps is so large that I have to save edge indices like this:

data.edge_index1 = ...
data.edge_index2 = ...
...
data.edge_index20 = ...

It doesn't look so nice. Hoping for better support for temporal graphs.

smorad commented 3 years ago

I'm also interested in representing batches of time-series data. I bet using the optimised torch_geometric scatter/gather would provide much better efficiency than the usual dense approach of "pad to max sequence_length across all batches".

Has there been any update on this? I could collapse batch and time to a single dimension, but I think that's less than ideal:

# Data shape [Batch, Time, features]
 Batch.from_data_list([data[b, t] for b in batches for t in timesteps])

rusty1s commented 3 years ago

Why do you think collapsing batch and time dimension is not ideal? Do you have any suggestion on how a "temporal" data handling should look like?

smorad commented 3 years ago

I suppose this goes into my specific use case. I'm adding a new node and some edges at each timestep, so at t-1 I would have t-1 nodes and at t I would have t nodes, t-1 of which are duplicated and identical to the the nodes at t-1. The memory usage blows up in this case.

I suppose we can share node pointers across batch and time by doing this instead:

# Data shape [Batch, Time, features]
 Batch.from_data_list([data for b in batches for t in timesteps])

Because the edges are actually responsible for batch/time. But how would this work with edges?

rusty1s commented 3 years ago

You could hold an additional vector which denotes which edges should be present in a specific timestamp, and then do a simple masking which should hold the memory requirements reasonable low:

edge_index = edge_index[:, timestamp < t]

ugurbolat commented 1 year ago

for those who are interested: check out the new library: pytorch geometric temporal