google-research-datasets / paws

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Other
548 stars 52 forks source link

Tokens split by space in English text #13

Open PhilipMay opened 3 years ago

PhilipMay commented 3 years ago

Hi, it seems like the text of the English sentences is split by space. Like here:

[...] Preserve , known as Palos Verdes Peninsula of California .

While German texts do not have these spaces.

[...] können, sind die Ergebnisse hoch.

Can you provide the English texts without those spaces?

yuanzh commented 3 years ago

Hi,

Thanks for reporting the issue. Unfortunately we don't have the texts before tokenization anymore. I believe the tokenization was done by nltk.word_tokenize, the same as the one used in QQP (https://github.com/google-research-datasets/paws/blob/master/qqp_generate_data.py).