Closed Rexhaif closed 8 months ago
Thank you for pointing this out! We are fixing it, and in the meantime, added a note about correcting the encoding when using the metrics.
Updating to confirm that dataset has the correct encoding now. Thanks again for pointing this out!
Hi folks,
Thanks for releasing the dataset!
One thing that is bothering me - encoding of non-ascii characters. For some reason all dataset files were produced with what seems to be an ascii encoding and all non-ascii characters are encoded with unicode codepoints. For instance for the russian summaries it looks like that (gem-id for this example is
xlsum_russian-validation-3133
):While it should look like that:
Of course, that could be easily fixed by the end user. They could, for instance, load the dataset like that:
Though i'm worried that some people might not notice that and consequently get stuck debugging their models which would not work on some hand crafted examples which use non-ascii characters.
This encoding error was already propagated to some community released versions of your dataset, like this one on hf hub, where it is no longer easily fixable.
Could you please update the dataset with fixed encoding?