Run disk version of MariusGNN failed

JIESUN233 commented 12 months ago

Hi, I'd like to run the disk version of MariusGNN. I found when I set the feature's storage type to PARTITION_BUFFER, I would meet the segmentation fault error:

(If I set the storage type to HOST_MEMORY, I could run the training procedure successfully.)

Specifically, I downloaded the master branch of Marius, built a docker, and conducted experiments in the docker. I followed these instructions to install Marius.

Here is my training config file:

And this is how I generate the datasets:

This is the dataset directory:

I would be very appreciated if you could help solve my issue~

rogerwaleffe commented 11 months ago

Hi there. Thanks for your question. It's not immediately obvious to me why this isn't working, but it's possible it is because you are trying to put the edges in Host_Memory. Can you try the following for your storage config:

  device_type: cuda
  dataset_dir: products_example/
  edges:
    type: FLAT_FILE
  nodes:
    type: HOST_MEMORY
  features:
    type: PARTITION_BUFFER
    options:
      num_partitions: 32
      buffer_capacity: 5
      prefetching: true
      fine_to_coarse_ratio: 1
      num_cache_partitions: 0
      node_partition_ordering: DISPERSED

rogerwaleffe commented 10 months ago

I looked into this issue a bit more and it seems there were some bugs in the code that appeared very infrequently, but more often when running disk-based training. I have fixed those issues in this PR (#147) and merged the changes into main.

Can you try running your config again?

With the updates, I did not have any issues running the following.

Preprocessing command: marius_preprocess --dataset ogbn_arxiv --output_dir datasets/ogbn_arxiv --num_partitions 32

Config:

model:
  learning_task: NODE_CLASSIFICATION
  encoder:
    use_incoming_nbrs: true
    use_outgoing_nbrs: true
    train_neighbor_sampling:
      - type: UNIFORM
        options:
          max_neighbors: 15
        use_hashmap_sets: true
      - type: UNIFORM
        options:
          max_neighbors: 10
      - type: UNIFORM
        options:
          max_neighbors: 5
    eval_neighbor_sampling:
      - type: UNIFORM
        options:
          max_neighbors: 15
        use_hashmap_sets: true
      - type: UNIFORM
        options:
          max_neighbors: 10
      - type: UNIFORM
        options:
          max_neighbors: 5
    layers:
      - - type: FEATURE
          output_dim: 128
          bias: false
          activation: NONE
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          init:
            type: GLOROT_NORMAL
          input_dim: 128
          output_dim: 128
          bias: true
          bias_init:
            type: ZEROS
          activation: RELU
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          init:
            type: GLOROT_NORMAL
          input_dim: 128
          output_dim: 128
          bias: true
          bias_init:
            type: ZEROS
          activation: RELU
      - - type: GNN
          options:
            type: GRAPH_SAGE
            aggregator: MEAN
          init:
            type: GLOROT_NORMAL
          input_dim: 128
          output_dim: 40
          bias: true
          bias_init:
            type: ZEROS
          activation: NONE
  decoder:
    type: NODE
  loss:
    type: CROSS_ENTROPY
    options:
      reduction: MEAN
  dense_optimizer:
    type: ADAM
    options:
      learning_rate: 0.003
storage:
  device_type: cuda
  dataset:
    dataset_dir: datasets/ogbn_arxiv/
    num_edges: 1166243
    num_nodes: 169343
    num_relations: 1
    num_train: 90941
    num_valid: 29799
    num_test: 48603
    feature_dim: 128
    num_classes: 40
  edges:
    type: FLAT_FILE
  nodes:
    type: HOST_MEMORY
  features:
    type: PARTITION_BUFFER
    options:
      num_partitions: 32
      buffer_capacity: 3
      prefetching: true
      fine_to_coarse_ratio: 1
      num_cache_partitions: 0
      node_partition_ordering: DISPERSED
  prefetch: true
  shuffle_input: true
  full_graph_evaluation: true
  train_edges_pre_sorted: false
training:
  batch_size: 1000
  num_epochs: 5
  pipeline:
    sync: true
  epochs_per_shuffle: 1
  logs_per_epoch: 10
evaluation:
  batch_size: 1000
  pipeline:
    sync: true
  epochs_per_eval: 1

marius-team / marius

Run disk version of MariusGNN failed #143