pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
20.95k stars 3.61k forks source link

Bag-of-words mapping for datasets #3728

Open Gori-LV opened 2 years ago

Gori-LV commented 2 years ago

📚 Documentation

I'm not sure this is the right tag but I would like ask if the bag-of-words mappings and more information on the semantic meaning of features could be provided for embedded datasets TORCH_GEOMETRIC.DATASETS.DBLP and TORCH_GEOMETRIC.DATASETS.IMDB, say, maybe a mapping file or readme in downloaded raw/ folder or a link in the source code/doc? It would be super helpful if one wants to analyse the actual information that the models take in.

For example, author nodes in the DBLP dataset adopt features of dimension 334 using bag-of-words, paper nodes and term nodes use 4231-dim and 50-dim features respectively. I'm curious what they stand for.

Many thanks!!

HeteroData(
  author={
    x=[4057, 334],
    y=[4057],
    train_mask=[4057],
    val_mask=[4057],
    test_mask=[4057]
  },
  paper={ x=[14328, 4231] },
  term={ x=[7723, 50] },
  conference={ num_nodes=20 },
  (author, to, paper)={ edge_index=[2, 19645] },
  (paper, to, author)={ edge_index=[2, 19645] },
  (paper, to, term)={ edge_index=[2, 85810] },
  (paper, to, conference)={ edge_index=[2, 14328] },
  (term, to, paper)={ edge_index=[2, 85810] },
  (conference, to, paper)={ edge_index=[2, 14328] }
)
rusty1s commented 2 years ago

Yes, this is a great suggestion, but I doubt that this is possible, as most datasets are just downloaded from the official code repository which introduced the respective dataset. In same cases, a README.md file exists, while this does not hold for others. As such, to fully understand a given dataset, it's best to look at the papers/code repositories linked to in the documentation. For example, for DBLP, this is:

In the code, there exists a pre-processing script for the DBLP dataset, see here.