snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

Negative samples don't have a type #211

Closed MikeDoes closed 3 years ago

MikeDoes commented 3 years ago

As we discussed, the entities are differentiated by both (type, index) and not just (index) in #210

However, in the negative samples of the valid set, there is no specification on what type it is.

If we assume, it is the same as the positive sample, it creates more than the entities specified (9377).

image

Notebook: https://colab.research.google.com/drive/1pUWrZVLve4Ohc3w3ZmsPYAIy55_T4NIC?usp=sharing

MikeDoes commented 3 years ago

Kindly let us know if you want us to edit the valid and train dataset to be coherent with the (index, type) data format.

(Earlier the better)

weihua916 commented 3 years ago

Here is the answer: https://ogb.stanford.edu/docs/linkprop/#ogbl-biokg "we only consider ranking against entities of the same type. For instance, when corrupting head entities of the protein type, we only consider negative protein entities."

MikeDoes commented 3 years ago

That makes sense, however it felt misleading. Maybe we can push a version that is changes the indices as implemented in this notebook: https://github.com/BrianPulfer/GDL-FinalProject/blob/main/notebooks/ComplEX_best_performances.ipynb

weihua916 commented 3 years ago

I think the current format is the best way to represent the heterogeneous KG, and users should feel free to convert it to a homogeneous KG if they want (which would inevitably drop node type information). Therefore, we will keep the format as it is.

Also, in terms of a method, it does not make much sense to perform negative sampling from invalid node types, e.g., training against protein tail nodes even if we know the tail type should be disease node.