snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 398 forks source link

Fix ogbg code #107

Closed weihua916 closed 3 years ago

weihua916 commented 3 years ago

This pull request fixes the issue with ogbg-code, where the prediction target (i.e., method name of a python function) is not properly masked out in input AST, resulting in label leakage.

Therefore, we deprecate ogbg-code and instead introduce ogbg-code2 that fixes the issue by properly masking out the prediction target (both the original function definition and its recursive definition) in input AST. We find that baseline performance becomes much worse (32% F1 --> 16% F1), indicating that data leakage is indeed a serious issue.

This bug is found thanks to our communication with Charles Sutton (@casutton). We gratefully acknowledge Charles for his critical insight.