Closed AladarMiao closed 5 years ago
Please ignore my second question, I didn't read the dataset composition process carefully enough.
Regarding the split: I agree that it's done in a randomized fashion. We are using it because the same split has been also used in several other papers, e.g. Gong et al, 2017 and Tomar et al 2017, so performance numbers are comparable.
thanks so much!
I read the PAWS paper, and it includes performance of models training, validating, and testing on the QQP dataset. (NOT PAWS-QQP) I'm just wondering if there is a standard split into these three groups? You guys said that "for the experiments in our paper, we used the train/dev/test split of the original QQP from Wang et al, 2017," but as far as I know, they split the data in a randomized fashion.
Furthermore, I noticed in the PAWS paper a significance drop in score going from QQP (train) -> PAWS-QQP (dev) for multiple models. Are those numbers for sure correct? It's just quite counter intuitive for me to see a model perform so bad on a dev set when the dev set is just a subset of the training set.
Thanks in advance!