metagraph-dev / metagraph

Multi-target API for graph analytics with Dask
https://metagraph.readthedocs.io/en/latest/
Apache License 2.0
26 stars 7 forks source link

Revisit basic types #50

Closed jim22k closed 4 years ago

jim22k commented 4 years ago
Type Description Characteristics (guidance) Abstract Properties Required Wrapper Methods/Properties
node_id - uint
- not really a type, more of a concept
Enum or Categorical? - 1:1 mapping from str to uint
- can be used to label nodes using strings
- O(1) lookup in either direction
Vector 1-D values - single dtype
- no missing values allowed
- O(1) lookup by position
- iteration order guaranteed
dtype
SparseVector Vector with missing values - single dtype
- missing values allowed
- O(1) lookup by position
- O(1) test for emptiness by position
dtype
Matrix 2-D values - single dtype
- no missing values allowed
- O(1) lookup by position
dtype, is_square
SparseMatrix Matrix with missing values - single dtype
- missing values allowed
- O(1) lookup by position
- O(1) test for emptiness by position
dtype, is_square
DataFrame - 2-D table
- each column has a unique string label
- each column has a single dtype
NodeSet a set of nodes - iteration order not guaranteed
- O(1) indication of node inclusion
num_nodes, __contains__
NodeMap a node:value mapping - iteration order not guaranteed
- O(1) lookup by node
- single dtype
- no missing values
dtype num_nodes, __getitem__, __contains__
NodeTable represents a set of nodes, with multiple values for each node; think of it as a DataFrame with a row index of nodes and a column for each property - each property has a single dtype
- each property has a unique string label
- values are allowed to be empty
num_nodes, num_properties
Graph a set of nodes plus an edge:value mapping - iteration order not guaranteed
- O(1) lookup for node inclusion
- O(1) lookup for edge value
- O(1) lookup for edge presence
dtype, is_directed, is_weighted, has_negative_weights num_nodes, num_edges
BipartiteGraph two distinct sets of nodes with edges allowed only between the sets; edges have values - each node set is logically distinct, but not necessarily numerically distinct; meaning node 1 in set A is different than node 1 in set B
- single dtype for edge values
- edges are undirected
dtype, is_weighted num_A_nodes, num_B_nodes, num_edges
PropertyGraph multiple graphs with the same nodes, each graph represents a unique property; each property has a unique set of edge weights - each graph has a single dtype
- each graph has a string label associated with the property
is_directed num_nodes, num_properties

Additional things to consider

  1. Which types require a Wrapper? Probably all the Nodes and Graph variants

  2. Should we formalize the hierarchy of abstract properties?

    • is_weighted governs whether dtype has any meaning
    • dtype governs whether has_negative_weights has any meaning
jim22k commented 4 years ago

Things brought up in the 2020-05-08 meeting:

eriknw commented 4 years ago

Without going into full detail, here is a summary of what Jim and I discussed.

I like having the primary objects:

A single node is generally given by its integer id, and a single edge is generally given by a a tuple of nodes.

For calls that return a single node or edge, we can create lightweight Node and Edge objects.

I think we should have NodeIndex that can map between node ids and any hashable Python object (and reverse). We should handle these conversions in metagraph--don`t require backends to do this (or maybe make it trivial to do).

All graph-like things can have an optional NodeIndex. This lets single nodes and edges to be referred to by any hashable Python objects excluding ints.

I think datatypes should have properties such as positive and nonzero.

Values from {Node,Edge}Map is simply e.g. a Python scalar.

Values from {Node,Edge}Table is a Python dict.

This gives us the possibility of composing a more rich Graph object that is comprised of a Node{Set,Map,Table} and Edge{Set,Map,Table} and an optional NodeIndex. We can then have edges_type and nodes_type as a property if we need to distinguish between these. This could be especially handy when converting to/from external graph objects.

Similarly, in the future we could introduce rich BipartiteGraph, DynamicGraph (and DyanmicEdge*), HyperGraph, MultiGraph, etc. objects

Question: can node or node_id be a dtype?

Finally, DataFrame, Vector, and Matrix are more-or-less unchanged and can evolve as necessary.