Revisit basic types - Githubissues

jim22k commented 4 years ago

Type	Description	Characteristics (guidance)	Abstract Properties	Required Wrapper Methods/Properties
node_id	- uint - not really a type, more of a concept
Enum or Categorical?	- 1:1 mapping from str to uint - can be used to label nodes using strings	- O(1) lookup in either direction
Vector	1-D values	- single dtype - no missing values allowed - O(1) lookup by position - iteration order guaranteed	dtype
SparseVector	Vector with missing values	- single dtype - missing values allowed - O(1) lookup by position - O(1) test for emptiness by position	dtype
Matrix	2-D values	- single dtype - no missing values allowed - O(1) lookup by position	dtype, is_square
SparseMatrix	Matrix with missing values	- single dtype - missing values allowed - O(1) lookup by position - O(1) test for emptiness by position	dtype, is_square
DataFrame	- 2-D table - each column has a unique string label - each column has a single dtype
NodeSet	a set of nodes	- iteration order not guaranteed - O(1) indication of node inclusion	`num_nodes`, `__contains__`
NodeMap	a node:value mapping	- iteration order not guaranteed - O(1) lookup by node - single dtype - no missing values	dtype	`num_nodes`, `__getitem__`, `__contains__`
NodeTable	represents a set of nodes, with multiple values for each node; think of it as a DataFrame with a row index of nodes and a column for each property	- each property has a single dtype - each property has a unique string label - values are allowed to be empty	`num_nodes`, `num_properties`
Graph	a set of nodes plus an edge:value mapping	- iteration order not guaranteed - O(1) lookup for node inclusion - O(1) lookup for edge value - O(1) lookup for edge presence	dtype, is_directed, is_weighted, has_negative_weights	`num_nodes`, `num_edges`
BipartiteGraph	two distinct sets of nodes with edges allowed only between the sets; edges have values	- each node set is logically distinct, but not necessarily numerically distinct; meaning node 1 in set A is different than node 1 in set B - single dtype for edge values - edges are undirected	dtype, is_weighted	`num_A_nodes`, `num_B_nodes`, `num_edges`
PropertyGraph	multiple graphs with the same nodes, each graph represents a unique property; each property has a unique set of edge weights	- each graph has a single dtype - each graph has a string label associated with the property	is_directed	`num_nodes`, `num_properties`

Additional things to consider

Which types require a Wrapper? Probably all the Nodes and Graph variants
Should we formalize the hierarchy of abstract properties?
- is_weighted governs whether dtype has any meaning
- dtype governs whether has_negative_weights has any meaning

jim22k commented 4 years ago

Things brought up in the 2020-05-08 meeting:

For Enum/Categorical, is a string label too restrictive? It seems good from a portability requirement, but warrants more discussion.
Should NodeMap have a label? That would allow a column of a NodeTable to be returned as a NodeMap.
Similarly, a column of a DataFrame is just a Vector if we can keep the column header as a label
What would a row of a NodeTable or DataFrame be represented as? Some kind of struct? Just return a 1-row version of the original?
Should unweighted graphs be their own type? We split NodeSet from NodeMap. You could make the same argument for splitting graphs based on whether there are values or not. Erik proposed Graph (unweighted) and Network (weighted).

eriknw commented 4 years ago

Without going into full detail, here is a summary of what Jim and I discussed.

I like having the primary objects:

NodeSet, NodeMap, NodeTable
EdgeSet, EdgeMap, EdgeTable

A single node is generally given by its integer id, and a single edge is generally given by a a tuple of nodes.

For calls that return a single node or edge, we can create lightweight Node and Edge objects.

I think we should have NodeIndex that can map between node ids and any hashable Python object (and reverse). We should handle these conversions in metagraph--don`t require backends to do this (or maybe make it trivial to do).

All graph-like things can have an optional NodeIndex. This lets single nodes and edges to be referred to by any hashable Python objects excluding ints.

I think datatypes should have properties such as positive and nonzero.

Values from {Node,Edge}Map is simply e.g. a Python scalar.

Values from {Node,Edge}Table is a Python dict.

This gives us the possibility of composing a more rich Graph object that is comprised of a Node{Set,Map,Table} and Edge{Set,Map,Table} and an optional NodeIndex. We can then have edges_type and nodes_type as a property if we need to distinguish between these. This could be especially handy when converting to/from external graph objects.

Similarly, in the future we could introduce rich BipartiteGraph, DynamicGraph (and DyanmicEdge*), HyperGraph, MultiGraph, etc. objects

Question: can node or node_id be a dtype?

Finally, DataFrame, Vector, and Matrix are more-or-less unchanged and can evolve as necessary.

metagraph-dev / metagraph

Revisit basic types #50

Additional things to consider