mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.62k stars 557 forks source link

[Dataset] Clarification on Dataset processing #767

Open BowenYao18 opened 1 month ago

BowenYao18 commented 1 month ago

https://github.com/mlcommons/training/blob/cdd928d4596c142c15a7d86b2eeadbac718c8da2/graph_neural_network/dataset.py#L137-L139

Let me use an example.

  1. Assume we have edge file like this:
    [0, 1, 2]  # cites_edge[0, :]
    [1, 2, 3]  # cites_edge[1, :]
  2. Then, we first do

add_self_loops(remove_self_loops(paper_paper_edges)[0])[0]

, which gives us this:

[0, 1, 2, 0, 1, 2, 3]  # cites_edge[0, :]
[1, 2, 3, 0, 1, 2, 3]  # cites_edge[1, :]
  1. Then, we have its reverse edge:
    [1, 2, 3, 0, 1, 2, 3]  # cites_edge[1, :]
    [0, 1, 2, 0, 1, 2, 3]  # cites_edge[0, :]
  2. If we follow this code

(torch.cat([cites_edge[1, :], cites_edge[0, :]]), torch.cat([cites_edge[0, :], cites_edge[1, :]])

, we should have this:

[1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3]  # cites_edge[1, :] + cites_edge[0, :]
[0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 0, 1, 2, 3]  # cites_edge[0, :] + cites_edge[1, :]

Instead of this below since we must exactly follow the MLPerf, we cannot have the other way around like this (torch.cat([cites_edge[0, :], cites_edge[1, :]]), torch.cat([cites_edge[1, :], cites_edge[0, :]])

[1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3]  # cites_edge[0, :] + cites_edge[1, :]
[0, 1, 2, 0, 1, 2, 3, 1, 2, 3, 0, 1, 2, 3]  # cites_edge[1, :] + cites_edge[0, :]

Am I understanding this correctly? Does the order matters here? Thank you!

Elnifio commented 1 month ago

In GNN training, we only care if the edge counts (graph topology) are the same, which means that you should have the exact number of edges (a->b for any combination of a, b in the reference implementation graph) as the reference implementation.

However, the order does not matter here - they will be rearranged in the COO -> CSC conversion process anyways, so you can use cites_edge[0, :] + cites_edge[1, :] for the source and cites_edge[1, :] + cites_edge[0, :] for the destination as well.