Encoding issues with non-ascii characters

Rexhaif commented 9 months ago

Hi folks,

Thanks for releasing the dataset!

One thing that is bothering me - encoding of non-ascii characters. For some reason all dataset files were produced with what seems to be an ascii encoding and all non-ascii characters are encoded with unicode codepoints. For instance for the russian summaries it looks like that (gem-id for this example is xlsum_russian-validation-3133):

\u0412 \u041b\u043e\u043d\u0434\u043e\u043d\u0435 \u0432\u043f\u0435\u0440\u0432\u044b\u0435 \u0432 \u0438\u0441\u0442\u043e\u0440\u0438\u0438 \u043f\u043e\u044f\u0432\u0438\u043b\u0438\u0441\u044c \u0444\u043e\u0442\u043e\u0433\u0440\u0430\u0444\u0438\u0438 \u043f\u043e\u043c\u043e\u043b\u0432\u043a\u0438 \u043f\u0440\u0438\u043d\u0446\u0430 \u0413\u0430\u0440\u0440\u0438 \u0438 \u041c\u0435\u0433\u0430\u043d \u041c\u0430\u0440\u043a\u043b.

While it should look like that:

В Лондоне впервые в истории появились фотографии помолвки принца Гарри и Меган Маркл.

Of course, that could be easily fixed by the end user. They could, for instance, load the dataset like that:

import pandas as pd

data = pd.read_csv("./seahorse_data/train.tsv", sep='\t', encoding='unicode-escape')

Though i'm worried that some people might not notice that and consequently get stuck debugging their models which would not work on some hand crafted examples which use non-ascii characters.

This encoding error was already propagated to some community released versions of your dataset, like this one on hf hub, where it is no longer easily fixable.

Could you please update the dataset with fixed encoding?

eaclark07 commented 9 months ago

Thank you for pointing this out! We are fixing it, and in the meantime, added a note about correcting the encoding when using the metrics.

eaclark07 commented 8 months ago

Updating to confirm that dataset has the correct encoding now. Thanks again for pointing this out!

google-research-datasets / seahorse

Encoding issues with non-ascii characters #2