Closed sophiakrix closed 2 years ago
Your edge_index_dict should be a dictionary of edge_index of shape (2, num_edges). It looks like you are giving an integer value (e.g., ('drug', 'INTERACTS_WITH', 'gene'): 0).
@weihua916 Thanks, that was actually the point! Now it works.
I've got a follow-up question:
I am at the 4th step, saving the data set split, from the tutorial on DatasetSaver. In there it says:
split_idx = dict()
perm = np.random.permutation(num_data)
split_idx['train'] = perm[:int(0.8*num_data)]
split_idx['valid'] = perm[int(0.8*num_data): int(0.9*num_data)]
split_idx['test'] = perm[int(0.9*num_data):]
saver.save_split(split_idx, split_name = 'random')
Could you possibly explain to me what exactly the num_data
should be? I saw that in the 2. step (Saving graph list), num_data
was set as an integer and there were as many graphs created that were added to the graph_list
. I was wondering whether the graphs should be the train, validation and test graphs. If so, does the edge_index_dict
and the num_nodes_dict
then need to correspond to the train, val, test graph respectively?
Sorry, but I am confused about the concept of the graph_list and was then asking myself what the split_idx
values should be. Would be great to get an insight into this!
Hi! The semantics of split_idx
is different for different tasks. For a graph-level task (which is given as an example), it should provide the indices of graphs in train/val/test sets. In your case, you are probably doing link prediction on a single graph. For such a case, split_idx
should split the edges. Please read this part of the test case to understand how you might want to set up your split_idx
in your scenario!
All right, I think I got it! I now just have one graph object in the graph_list
, and I added the indices of the edges for the corresponding splits:
# Random split
num_edges_total = graph.number_of_edges()
train_ratio = 0.7
val_ratio = 0.2
test_ratio = 0.1
all_edges = np.array(list(graph.edges))
# define indices of edges for train, val, test split
all_edge_indices = [x for x in range(0, num_edges_total)]
random.shuffle(all_edge_indices)
train_indices = np.array(all_edge_indices[:int(train_ratio*num_edges_total)])
val_indices = np.array(all_edge_indices[int(train_ratio*num_edges_total):int((train_ratio+val_ratio)*num_edges_total)])
test_indices = np.array(all_edge_indices[int((train_ratio+val_ratio)*num_edges_total):])
# Save split indices
split_idx = dict()
split_idx['train'] = train_indices
split_idx['valid'] = val_indices
split_idx['test'] = test_indices
saver.save_split(split_idx, split_name = 'random')
When I am at the 8th step (Testing the dataset object), I actually get an error. I am not sure if this is since I am doing link prediction, and not graph property prediction as in the tutorial. I tried with the LinkPropPredDataSet
and the PygLinkPropPredDataset
, and for both it throws me an error that the get_idx_split()
function is not implemented:
In:
from ogb.linkproppred.dataset_pyg import PygLinkPropPredDataset
dataset = PygLinkPropPredDataset(dataset_name, meta_dict = meta_dict)
# see if it is working properly
print(dataset[0])
print(dataset.get_idx_split())
Out:
Processing...
Loading necessary files...
This might take a while.
Processing graphs...
100%|██████████| 1/1 [00:00<00:00, 1367.56it/s]
Converting graphs into PyG objects...
100%|██████████| 1/1 [00:00<00:00, 390.60it/s]
Saving...
Data(
num_nodes_dict={
disease=21757,
drug=2802,
gene=18692,
pathway=2441,
variant=333236
},
edge_index_dict={
...
(variant, toxicity, drug)=[2, 3271]
}
)
Done!
AttributeError: 'PygLinkPropPredDataset' object has no attribute 'get_idx_split'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-114-db1f342409ed> in <module>
4 # see if it is working properly
5 print(dataset[0])
----> 6 print(dataset.get_idx_split())
AttributeError: 'PygLinkPropPredDataset' object has no attribute 'get_idx_split'
Use get_edge_split
Thanks, that works!
How is it possible to call the graph object in another script? I have one notebook where I process the custom graph object and then I have a script where I want to do ML with this graph object. Yet simply calling
dataset = LinkPropPredDataset(dataset_name, meta_dict = meta_dict)
with the meta_dict
I saved does not work.
Hi there,
I was experimenting with the DatasetSaver class to reformat my own knowledge graph into the ogb format . I am working with a heterogeneous graph that has multiple edge and node types.
Could you tell me why there is a
graph_list
required (link to code), and not just a single graph instance?Since I have on graph object (that I created from several edge type specific graphs), I passed this one to the graph_list.
The error I got is the following:
Could you tell me why this happens and what I can do about it? Thanks in advance!