google-research-datasets / paws

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Other
553 stars 52 forks source link

[Question] Is the labeled dataset contained in the unlabeled dataset as well? #1

Closed f-lng closed 5 years ago

f-lng commented 5 years ago

Hello,

Thank you very much for sharing this dataset, great work.

I have one question: Are the datapoints from "PAWS-Wiki Labeled (Final)" contained in "PAWS-Wiki Unlabeled (Final)" as well?

Thanks!

yuanzh commented 5 years ago

Thanks for your interest! The datapoints from "Labeled" are not contained in "Unlabeled"

f-lng commented 5 years ago

I found an overlap of 507 pairs, which is minimal, but I figured you would like to know anyway :-)

yuanzh commented 5 years ago

Nice catch! Thanks for letting us know the issue!