marius-team / marius

Large scale graph learning on a single machine.
https://marius-project.org
Apache License 2.0
160 stars 45 forks source link

'test_edges.bin' and 'validation_edges.bin' are not created when preprocess ogbn_products. #110

Closed qhtjrmin closed 2 years ago

qhtjrmin commented 2 years ago

Hi, I want to run marius with ogbn-products dataset.

I executed the following command: marius_preprocess --dataset ogbn_products --output_dir datasets/ogbn_products

There was no problem running it, but only 'train_edges.bin' was created in ogbn_products/edges. There is no 'test_edges.bin' and 'validation_edges.bin'. How could I get them??

Thanks a lot.

rogerwaleffe commented 2 years ago

Thanks for your question. This is actually the expected behavior because ogbn-products is a benchmark graph for the task of node classification. The goal for this graph is to predict the category of a product (node) in a multi-class classification setup. Thus the nodes in the graph are split into a set of label training nodes, validation nodes, and test nodes, but the edges of the graph are not split in anyway. All edges in the graph are contained in the train_edges.bin file.

If you would like to run link prediction on the ogbn-products dataset you will need to treat it as a custom graph for link prediction and split the graph edges into random train/val/test edges manually. This can be done as outlined in this example. In that example, we use the ogbn-arxiv node classification dataset as a custom graph for link prediction.

qhtjrmin commented 2 years ago

Thank you for your rapid reply!

I wanted to perform node classification with this dataset, but I get the following error:

$ ./marius_train examples/configuration/ogbn_product.yaml [2022-08-10 02:20:02.722] [info] [marius.cpp:43] Start initialization file reading error: /home/marius/datasets/ogbn_products/edges/validation_edges.bin file reading error: /home/marius/datasets/ogbn_products/edges/test_edges.bin

In this case, how could I run marius_train??

rogerwaleffe commented 2 years ago

Can you share the ogbn_products.yaml file that you are using?

qhtjrmin commented 2 years ago

I wanted to try running first, so I used _customnc.yaml almost as it is.

The following is the current ogbn_products.yaml:

model: learning_task: NODE_CLASSIFICATION encoder: train_neighbor_sampling:

  • type: ALL
  • type: ALL
  • type: ALL layers:
    • type: FEATURE output_dim: 100 bias: true
    • type: GNN options: type: GRAPH_SAGE aggregator: MEAN input_dim: 100 output_dim: 100 bias: true
    • type: GNN options: type: GRAPH_SAGE aggregator: MEAN input_dim: 100 output_dim: 100 bias: true
    • type: GNN options: type: GRAPH_SAGE aggregator: MEAN input_dim: 100 output_dim: 47 bias: true decoder: type: NODE loss: type: CROSS_ENTROPY options: reduction: SUM dense_optimizer: type: ADAM options: learning_rate: 0.01 storage: device_type: cuda dataset: dataset_dir: datasets/ogbn_products edges: type: DEVICE_MEMORY options: dtype: int features: type: DEVICE_MEMORY options: dtype: float training: batch_size: 1000 num_epochs: 10 pipeline: sync: true evaluation: batch_size: 1000 pipeline: sync: true
rogerwaleffe commented 2 years ago

Nothing immediately jumps out as a problem. However I can't tell if the yaml formatting is correct. Can you also share the full_config.yaml file which is generated in the dataset directory so I can take a look at that?

qhtjrmin commented 2 years ago

I can't find full_config.yaml file. Do you mean dataset.yaml which is generated in the dataset directory?? The following is dataset.yaml:

dataset_dir: /home/marius/datasets/ogbn_products/ num_edges: 61859140 num_nodes: 2449029 num_relations: 1 num_train: 196615 num_valid: 39323 num_test: 2213091 node_feature_dim: 100 rel_feature_dim: -1 num_classes: 47 initialized: false

rogerwaleffe commented 2 years ago

If you can't find the full_config.yaml file, my guess is that your ogbn_products.yaml has some formatting problems causing the parsing of the config to get messed up. Try this config file (where I have also switched to uniform neighborhood sampling because all neighbors will likely cause a GPU OOM on the products graph). This config file worked for me on the main branch. You can also see another example config for products here.

model:
  learning_task: NODE_CLASSIFICATION
  encoder:
    train_neighbor_sampling:
      - type: UNIFORM
        options:
          max_neighbors: 15
      - type: UNIFORM
        options:
          max_neighbors: 10
      - type: UNIFORM
        options:
          max_neighbors: 5
    eval_neighbor_sampling:
      - type: UNIFORM
        options:
          max_neighbors: 15
      - type: UNIFORM
        options:
          max_neighbors: 10
      - type: UNIFORM
        options:
          max_neighbors: 5
    layers:
      - - type: FEATURE
          output_dim: 100
          bias: true
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          input_dim: 100
          output_dim: 100
          bias: true
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          input_dim: 100
          output_dim: 100
          bias: true
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          input_dim: 100
          output_dim: 47
          bias: true
  decoder:
    type: NODE
  loss:
    type: CROSS_ENTROPY
    options:
      reduction: SUM
  dense_optimizer:
    type: ADAM
    options:
      learning_rate: 0.01
storage:
  device_type: cuda
  dataset:
    dataset_dir: datasets/ogbn_products
  edges:
    type: DEVICE_MEMORY
    options:
      dtype: int
  features:
    type: DEVICE_MEMORY
    options:
      dtype: float
training:
  batch_size: 1000
  num_epochs: 10
  pipeline:
    sync: true
evaluation:
  batch_size: 1000
  pipeline:
    sync: true
qhtjrmin commented 2 years ago

Oh, Thanks a lot! Now it works well! Additionally, could you share the configuration files used in the experiments of marius++ paper (NC - Papers100M, Mag240M-C / LP - Freebase86m, WikiKG90Mv2)?

rogerwaleffe commented 2 years ago

Yes, those will be shared soon. There will be a separate branch with the Marius++ artifact in the next 2 weeks, so keep an eye out! Closing this issue.

qhtjrmin commented 2 years ago

Hello, could I know when the configuration files for the experiments of marius++ paper would be released??

rogerwaleffe commented 2 years ago

We have released the artifact for the Marius++ paper (which will be renamed to MariusGNN) here. The configs used for the paper can be found there (some configs have not been uploaded yet but will be soon). Note that as Marius/MariusGNN are under active development, the artifact contains an older version of the code base. As such, the configuration file format is different than the configuration file format currently used by the main branch, but older artifact configs should be able to be converted to the new format easily.

Krith-man commented 1 year ago

Hello,

My question is related to the node classification task edges. As you referred, the training edges include all the edge of the dataset. However, given a set of training nodes N, we should work on only the edges that include nodes which belong to N, right? If not then it is possible to pick a sample node v, which might not included on the training nodes set N. Therefore, my question to you is why do you keep working on all the edges when you train with a subset of nodes included on those?

rogerwaleffe commented 1 year ago

This is a good question!

Fundamentally, the node classification learning task takes as input a graph (a set of nodes and their edges) and the base features of each node in this graph. Then the goal is to predict labels of the nodes in the graph using the base features and the edges.

To train a model to perform this task, we manually label a subset of the nodes in the graph (N, as you call it). Let's focus on a single node u in this set N. While training a GNN model, we use the base features of the node u, as well as the base features of its neighbors in the input graph to predict the label (which we know from the manual labeling).

Consider a node v in the neighborhood of u. Your concern is what happens if v is not a training node? The reason this is okay is that when node v is used as a neighbor of node u, the label of node v is not used. For example, even if node v is a "test node" with a hidden label, this information is never used when training on node u. Only the base features of node v and the fact that node v and u are connected are used when training the model. This information is assumed to be valid to use as part of the general node classification task above. I.e., to predict the label of u, we assume we are allowed to use the edges in the graph and the features of the nodes in the graph.

In fact in the general case of node classification over large graphs, it is most often the case that node v has no label at all (i.e., it is not part of the train set or the validation set or the test set).