Closed MikeDoes closed 3 years ago
Kindly let us know if you want us to edit the valid and train dataset to be coherent with the (index, type) data format.
(Earlier the better)
Here is the answer: https://ogb.stanford.edu/docs/linkprop/#ogbl-biokg "we only consider ranking against entities of the same type. For instance, when corrupting head entities of the protein type, we only consider negative protein entities."
That makes sense, however it felt misleading. Maybe we can push a version that is changes the indices as implemented in this notebook: https://github.com/BrianPulfer/GDL-FinalProject/blob/main/notebooks/ComplEX_best_performances.ipynb
I think the current format is the best way to represent the heterogeneous KG, and users should feel free to convert it to a homogeneous KG if they want (which would inevitably drop node type information). Therefore, we will keep the format as it is.
Also, in terms of a method, it does not make much sense to perform negative sampling from invalid node types, e.g., training against protein tail nodes even if we know the tail type should be disease node.
As we discussed, the entities are differentiated by both (type, index) and not just (index) in #210
However, in the negative samples of the valid set, there is no specification on what type it is.
If we assume, it is the same as the positive sample, it creates more than the entities specified (9377).
Notebook: https://colab.research.google.com/drive/1pUWrZVLve4Ohc3w3ZmsPYAIy55_T4NIC?usp=sharing