[Discussion] Relevance of GreaseLM results in light of 'GNN is a Counter?..' paper + dataset discussion

Very interesting work on the combination of LM + KG! This is something I am looking into myself as a research project (https://github.com/apoorvumang/transformer-kgc), and I thought this would be a good place to discuss what datasets such models should be used on.

In the very recently released paper GNN is a Counter? Revisiting GNN for Question Answering, (code at https://github.com/anonymousGSC/graph-soft-counter), they show that a 1-dim GNN + LM is able to achieve almost SOTA results on both OpenBookQA and CommonsenseQA. In fact according to their numbers it even outperforms GreaseLM on both these datasets.

I would like to discuss a few things regarding the dataset situation:

CommonSenseQA leaderboard no longer accepts ConceptNet based submissions, which is quite a bummer, and OpenBookQA is extremely small (500 test and 500 dev questions only, around 5k train). Is it worth it (for me and others) to work with these datasets, given the findings of 'GNN...' paper?
If not, could GreaseLM (and similar methods) be applied to regular KGQA datasets such WebQuestionsSP, ComplexWebQuestions or GrailQA? This of course would be harder since its no longer MCQ reasoning, but it might be more interesting and can give real evidence of LM + KG based reasoning.
Is there any other datasets apart from the ones I mentioned that could be relevant in this area? (MedQA-USMLE is ofc one, but I feel it is quite new, and having another older/more established dataset would be an advantage)

Looking forward to a healthy discussion! 😊

snap-stanford / GreaseLM

[Discussion] Relevance of GreaseLM results in light of 'GNN is a Counter?..' paper + dataset discussion #3