snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

Error if `graph_list` in DatasetSaver contains only one graph #274

Closed sophiakrix closed 2 years ago

sophiakrix commented 2 years ago

Hi there,

I was experimenting with the DatasetSaver class to reformat my own knowledge graph into the ogb format . I am working with a heterogeneous graph that has multiple edge and node types.

Could you tell me why there is a graph_list required (link to code), and not just a single graph instance?

Since I have on graph object (that I created from several edge type specific graphs), I passed this one to the graph_list.

>>> edge mapping
{('drug', 'INTERACTS_WITH', 'gene'): 0,
 ...
 ('gene', 'ASSOCIATED_WITH', 'disease'): 17}

>>> num_nodes_dict
{'drug': 4642,
...
 'variant': 335975}
graph_list = []

graph = dict()
graph['edge_index_dict'] = edge_mapping
graph['num_nodes_dict'] = num_nodes_dict

graph_list.append(graph)

# saving a list of graphs
saver.save_graph_list(graph_list)

The error I got is the following:

ValueError: zero-dimensional arrays cannot be concatenated
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-110-b9767e0de803> in <module>
      8 
      9 # saving a list of graphs
---> 10 saver.save_graph_list(graph_list)

~/git/ogb/ogb/io/save_dataset.py in save_graph_list(self, graph_list)
    363 
    364         if self.is_hetero:
--> 365             self._save_graph_list_hetero(graph_list)
    366             self.has_node_attr = ('node_feat_dict' in graph_list[0]) and (graph_list[0]['node_feat_dict'] is not None)
    367             self.has_edge_attr = ('edge_feat_dict' in graph_list[0]) and (graph_list[0]['edge_feat_dict'] is not None)

~/git/ogb/ogb/io/save_dataset.py in _save_graph_list_hetero(self, graph_list)
    115             # representing triplet (head, rel, tail) as a single string 'head___rel___tail'
    116             triplet_cat = '___'.join(triplet)
--> 117             edge_index = np.concatenate([graph['edge_index_dict'][triplet] for graph in graph_list], axis = 1).astype(np.int64)
    118             if edge_index.shape[0] != 2:
    119                 raise RuntimeError('edge_index must have shape (2, num_edges)')

<__array_function__ internals> in concatenate(*args, **kwargs)

ValueError: zero-dimensional arrays cannot be concatenated

Could you tell me why this happens and what I can do about it? Thanks in advance!

weihua916 commented 2 years ago

Your edge_index_dict should be a dictionary of edge_index of shape (2, num_edges). It looks like you are giving an integer value (e.g., ('drug', 'INTERACTS_WITH', 'gene'): 0).

sophiakrix commented 2 years ago

@weihua916 Thanks, that was actually the point! Now it works.

sophiakrix commented 2 years ago

I've got a follow-up question:

I am at the 4th step, saving the data set split, from the tutorial on DatasetSaver. In there it says:

split_idx = dict()
perm = np.random.permutation(num_data)
split_idx['train'] = perm[:int(0.8*num_data)]
split_idx['valid'] = perm[int(0.8*num_data): int(0.9*num_data)]
split_idx['test'] = perm[int(0.9*num_data):]
saver.save_split(split_idx, split_name = 'random')

Could you possibly explain to me what exactly the num_data should be? I saw that in the 2. step (Saving graph list), num_data was set as an integer and there were as many graphs created that were added to the graph_list. I was wondering whether the graphs should be the train, validation and test graphs. If so, does the edge_index_dict and the num_nodes_dict then need to correspond to the train, val, test graph respectively?

Sorry, but I am confused about the concept of the graph_list and was then asking myself what the split_idx values should be. Would be great to get an insight into this!

weihua916 commented 2 years ago

Hi! The semantics of split_idx is different for different tasks. For a graph-level task (which is given as an example), it should provide the indices of graphs in train/val/test sets. In your case, you are probably doing link prediction on a single graph. For such a case, split_idx should split the edges. Please read this part of the test case to understand how you might want to set up your split_idx in your scenario!

sophiakrix commented 2 years ago

All right, I think I got it! I now just have one graph object in the graph_list, and I added the indices of the edges for the corresponding splits:

# Random split

num_edges_total = graph.number_of_edges()
train_ratio = 0.7
val_ratio = 0.2
test_ratio = 0.1

all_edges = np.array(list(graph.edges))

# define indices of edges for train, val, test split
all_edge_indices = [x for x in range(0, num_edges_total)]
random.shuffle(all_edge_indices)
train_indices = np.array(all_edge_indices[:int(train_ratio*num_edges_total)])
val_indices = np.array(all_edge_indices[int(train_ratio*num_edges_total):int((train_ratio+val_ratio)*num_edges_total)])
test_indices = np.array(all_edge_indices[int((train_ratio+val_ratio)*num_edges_total):])

# Save split indices 
split_idx = dict()
split_idx['train'] = train_indices
split_idx['valid'] = val_indices
split_idx['test'] = test_indices
saver.save_split(split_idx, split_name = 'random')

When I am at the 8th step (Testing the dataset object), I actually get an error. I am not sure if this is since I am doing link prediction, and not graph property prediction as in the tutorial. I tried with the LinkPropPredDataSet and the PygLinkPropPredDataset, and for both it throws me an error that the get_idx_split() function is not implemented: In:

from ogb.linkproppred.dataset_pyg import PygLinkPropPredDataset
dataset = PygLinkPropPredDataset(dataset_name, meta_dict = meta_dict)

# see if it is working properly
print(dataset[0])
print(dataset.get_idx_split())

Out:

Processing...
Loading necessary files...
This might take a while.
Processing graphs...
100%|██████████| 1/1 [00:00<00:00, 1367.56it/s]
Converting graphs into PyG objects...
100%|██████████| 1/1 [00:00<00:00, 390.60it/s]
Saving...
Data(
  num_nodes_dict={
    disease=21757,
    drug=2802,
    gene=18692,
    pathway=2441,
    variant=333236
  },
  edge_index_dict={
...
(variant, toxicity, drug)=[2, 3271]
  }
)

Done!

AttributeError: 'PygLinkPropPredDataset' object has no attribute 'get_idx_split'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-114-db1f342409ed> in <module>
      4 # see if it is working properly
      5 print(dataset[0])
----> 6 print(dataset.get_idx_split())

AttributeError: 'PygLinkPropPredDataset' object has no attribute 'get_idx_split'
weihua916 commented 2 years ago

Use get_edge_split

sophiakrix commented 2 years ago

Thanks, that works!

How is it possible to call the graph object in another script? I have one notebook where I process the custom graph object and then I have a script where I want to do ML with this graph object. Yet simply calling

dataset = LinkPropPredDataset(dataset_name, meta_dict = meta_dict)

with the meta_dict I saved does not work.