google-research-datasets / paws

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Other
553 stars 52 forks source link

Do you use QQP_dev to decide the model for evaluation on PAWS? #12

Closed Punchwes closed 3 years ago

Punchwes commented 3 years ago

Hi, given PAWS_QQP does not have a separate dev or test set, in your original training strategy, I wonder how do you decide the model when the scenario is: QQP -> PAWS and QQP+PAWS_train -> PAWS. Do you still use the QQP_dev for things like early stopping, or you directly evaluate models on PAWS along training and pick the maximum acc/auc number?

Many thanks.

yuanzh commented 3 years ago

Hi, sorry for the late reply. I think we used QQP_dev for early stopping. If I remember correctly, we also tried using PAWS_dev to pick the best checkpoint. That would be a bit overfitting to PAWS_dev but it didn't make a big difference. QQP model was pretty bad on PAWS_dev regardless what dev set you use to pick the best checkpoint. QQP+PAWS_train converged to the best performance on both QQP_dev and PAWS_dev.