Closed qhtjrmin closed 2 years ago
Thanks for your question. This is actually the expected behavior because ogbn-products is a benchmark graph for the task of node classification. The goal for this graph is to predict the category of a product (node) in a multi-class classification setup. Thus the nodes in the graph are split into a set of label training nodes, validation nodes, and test nodes, but the edges of the graph are not split in anyway. All edges in the graph are contained in the train_edges.bin file.
If you would like to run link prediction on the ogbn-products dataset you will need to treat it as a custom graph for link prediction and split the graph edges into random train/val/test edges manually. This can be done as outlined in this example. In that example, we use the ogbn-arxiv node classification dataset as a custom graph for link prediction.
Thank you for your rapid reply!
I wanted to perform node classification with this dataset, but I get the following error:
$ ./marius_train examples/configuration/ogbn_product.yaml [2022-08-10 02:20:02.722] [info] [marius.cpp:43] Start initialization file reading error: /home/marius/datasets/ogbn_products/edges/validation_edges.bin file reading error: /home/marius/datasets/ogbn_products/edges/test_edges.bin
In this case, how could I run marius_train??
Can you share the ogbn_products.yaml file that you are using?
I wanted to try running first, so I used _customnc.yaml almost as it is.
The following is the current ogbn_products.yaml:
model: learning_task: NODE_CLASSIFICATION encoder: train_neighbor_sampling:
- type: ALL
- type: ALL
- type: ALL layers:
- type: FEATURE output_dim: 100 bias: true
- type: GNN options: type: GRAPH_SAGE aggregator: MEAN input_dim: 100 output_dim: 100 bias: true
- type: GNN options: type: GRAPH_SAGE aggregator: MEAN input_dim: 100 output_dim: 100 bias: true
- type: GNN options: type: GRAPH_SAGE aggregator: MEAN input_dim: 100 output_dim: 47 bias: true decoder: type: NODE loss: type: CROSS_ENTROPY options: reduction: SUM dense_optimizer: type: ADAM options: learning_rate: 0.01 storage: device_type: cuda dataset: dataset_dir: datasets/ogbn_products edges: type: DEVICE_MEMORY options: dtype: int features: type: DEVICE_MEMORY options: dtype: float training: batch_size: 1000 num_epochs: 10 pipeline: sync: true evaluation: batch_size: 1000 pipeline: sync: true
Nothing immediately jumps out as a problem. However I can't tell if the yaml formatting is correct. Can you also share the full_config.yaml file which is generated in the dataset directory so I can take a look at that?
I can't find full_config.yaml file. Do you mean dataset.yaml which is generated in the dataset directory?? The following is dataset.yaml:
dataset_dir: /home/marius/datasets/ogbn_products/ num_edges: 61859140 num_nodes: 2449029 num_relations: 1 num_train: 196615 num_valid: 39323 num_test: 2213091 node_feature_dim: 100 rel_feature_dim: -1 num_classes: 47 initialized: false
If you can't find the full_config.yaml file, my guess is that your ogbn_products.yaml has some formatting problems causing the parsing of the config to get messed up. Try this config file (where I have also switched to uniform neighborhood sampling because all neighbors will likely cause a GPU OOM on the products graph). This config file worked for me on the main branch. You can also see another example config for products here.
model:
learning_task: NODE_CLASSIFICATION
encoder:
train_neighbor_sampling:
- type: UNIFORM
options:
max_neighbors: 15
- type: UNIFORM
options:
max_neighbors: 10
- type: UNIFORM
options:
max_neighbors: 5
eval_neighbor_sampling:
- type: UNIFORM
options:
max_neighbors: 15
- type: UNIFORM
options:
max_neighbors: 10
- type: UNIFORM
options:
max_neighbors: 5
layers:
- - type: FEATURE
output_dim: 100
bias: true
- - type: GNN
options:
type: GRAPH_SAGE
aggregator: MEAN
input_dim: 100
output_dim: 100
bias: true
- - type: GNN
options:
type: GRAPH_SAGE
aggregator: MEAN
input_dim: 100
output_dim: 100
bias: true
- - type: GNN
options:
type: GRAPH_SAGE
aggregator: MEAN
input_dim: 100
output_dim: 47
bias: true
decoder:
type: NODE
loss:
type: CROSS_ENTROPY
options:
reduction: SUM
dense_optimizer:
type: ADAM
options:
learning_rate: 0.01
storage:
device_type: cuda
dataset:
dataset_dir: datasets/ogbn_products
edges:
type: DEVICE_MEMORY
options:
dtype: int
features:
type: DEVICE_MEMORY
options:
dtype: float
training:
batch_size: 1000
num_epochs: 10
pipeline:
sync: true
evaluation:
batch_size: 1000
pipeline:
sync: true
Oh, Thanks a lot! Now it works well! Additionally, could you share the configuration files used in the experiments of marius++ paper (NC - Papers100M, Mag240M-C / LP - Freebase86m, WikiKG90Mv2)?
Yes, those will be shared soon. There will be a separate branch with the Marius++ artifact in the next 2 weeks, so keep an eye out! Closing this issue.
Hello, could I know when the configuration files for the experiments of marius++ paper would be released??
We have released the artifact for the Marius++ paper (which will be renamed to MariusGNN) here. The configs used for the paper can be found there (some configs have not been uploaded yet but will be soon). Note that as Marius/MariusGNN are under active development, the artifact contains an older version of the code base. As such, the configuration file format is different than the configuration file format currently used by the main branch, but older artifact configs should be able to be converted to the new format easily.
Hello,
My question is related to the node classification task edges. As you referred, the training edges include all the edge of the dataset. However, given a set of training nodes N, we should work on only the edges that include nodes which belong to N, right? If not then it is possible to pick a sample node v, which might not included on the training nodes set N. Therefore, my question to you is why do you keep working on all the edges when you train with a subset of nodes included on those?
This is a good question!
Fundamentally, the node classification learning task takes as input a graph (a set of nodes and their edges) and the base features of each node in this graph. Then the goal is to predict labels of the nodes in the graph using the base features and the edges.
To train a model to perform this task, we manually label a subset of the nodes in the graph (N, as you call it). Let's focus on a single node u in this set N. While training a GNN model, we use the base features of the node u, as well as the base features of its neighbors in the input graph to predict the label (which we know from the manual labeling).
Consider a node v in the neighborhood of u. Your concern is what happens if v is not a training node? The reason this is okay is that when node v is used as a neighbor of node u, the label of node v is not used. For example, even if node v is a "test node" with a hidden label, this information is never used when training on node u. Only the base features of node v and the fact that node v and u are connected are used when training the model. This information is assumed to be valid to use as part of the general node classification task above. I.e., to predict the label of u, we assume we are allowed to use the edges in the graph and the features of the nodes in the graph.
In fact in the general case of node classification over large graphs, it is most often the case that node v has no label at all (i.e., it is not part of the train set or the validation set or the test set).
Hi, I want to run marius with ogbn-products dataset.
I executed the following command:
marius_preprocess --dataset ogbn_products --output_dir datasets/ogbn_products
There was no problem running it, but only 'train_edges.bin' was created in ogbn_products/edges. There is no 'test_edges.bin' and 'validation_edges.bin'. How could I get them??
Thanks a lot.