How are text feature representations chosen?

snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning

https://ogb.stanford.edu

MIT License

1.89k stars 397 forks source link

How are text feature representations chosen? #228

Closed tsafavi closed 3 years ago

tsafavi commented 3 years ago

Hi, thank you for your contribution with these datasets. I was wondering - for the graphs that have text features on the nodes (e.g., ogbn-products, ogbn-arxiv), how are the text representation methods chosen? They seem to differ by dataset, e.g., bag-of-words + PCA for ogbn-producs versus average word2vec for ogbn-arxiv, but I didn't see any justification for these choices.

Tara

weihua916 commented 3 years ago

Good question! word2vec averaging is used for most of the datasets (ogbn-arxiv, ogbn-papers100M, ogbn-mag, ogbl-collab, ogbl-citation2). For ogbn-products, we used the PCA feature since we directly adopted the dataset from [1], where we only changed the dataset splitting.

[1] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 257–266, 2019.

tsafavi commented 3 years ago

Thank you for your response. As a follow-up, are leaderboard submissions with text feature representations not provided by the benchmark allowed? For example, bag of words rather than word2vec for ogbn-arxiv.

weihua916 commented 3 years ago

It is allowed. You may explicitly mention that in the method title to make that clear.

weihua916 commented 3 years ago

FYI: you can get raw text of ogbn-arxiv here. For the other datasets, you will need to retrieve text information yourself by using the mapping information stored in mapping/ of the dataset folder.

tsafavi commented 3 years ago

Thank you very much!!