google-research-datasets / paws

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Other
553 stars 52 forks source link

Regarding Quora train/dev/test split #2

Closed AladarMiao closed 5 years ago

AladarMiao commented 5 years ago

I read the PAWS paper, and it includes performance of models training, validating, and testing on the QQP dataset. (NOT PAWS-QQP) I'm just wondering if there is a standard split into these three groups? You guys said that "for the experiments in our paper, we used the train/dev/test split of the original QQP from Wang et al, 2017," but as far as I know, they split the data in a randomized fashion.

Furthermore, I noticed in the PAWS paper a significance drop in score going from QQP (train) -> PAWS-QQP (dev) for multiple models. Are those numbers for sure correct? It's just quite counter intuitive for me to see a model perform so bad on a dev set when the dev set is just a subset of the training set.

Thanks in advance!

AladarMiao commented 5 years ago

Please ignore my second question, I didn't read the dataset composition process carefully enough.

yuanzh commented 5 years ago

Regarding the split: I agree that it's done in a randomized fashion. We are using it because the same split has been also used in several other papers, e.g. Gong et al, 2017 and Tomar et al 2017, so performance numbers are comparable.

AladarMiao commented 5 years ago

thanks so much!