glottolog / glottolog-legacy

DEPRECATED. See https://github.com/clld/glottolog
12 stars 11 forks source link

Miscodings in languoid.csv #79

Closed jrpool closed 8 years ago

jrpool commented 8 years ago

In several places, the capital letter E with acute accent is miscoded as \u00c3\u2030 instead of \u00c3\u0089.

xflr6 commented 8 years ago

Are you sure you refer to the right code points?

>>> import unicodedata
>>> unicodedata.name(u'\u00c3')
'LATIN CAPITAL LETTER A WITH TILDE'
>>> unicodedata.name(u'\u2030')
'PER MILLE SIGN'
>>> unicodedata.name(u'\u0089', None)
>>> unicodedata.normalize('NFC', u'\N{LATIN CAPITAL LETTER E WITH ACUTE}')
u'\xc9'
>>> unicodedata.normalize('NFD', u'\N{LATIN CAPITAL LETTER E WITH ACUTE}')
u'E\u0301'

Maybe you did not use the right encoding when reading the file.

>>> import io, collections
>>> chars = collections.Counter(io.open('languoid.csv', encoding='utf-8').read())
>>> [chars[c] for c in  u'\u00c3\u2030\u0089']
[0, 0, 0]
>>> [chars[c] for c in  u'E\u00c9\u0301']
[9829, 0, 0]
jrpool commented 8 years ago

I am having difficulty understand this reply. Here is one example of the pattern that I reported:

assiniboine (\u00c3\u2030tats-Unis d'Am\u00c3\u00a9rique)

The character before "tats" is, presumably, intended to be É.

xflr6 commented 8 years ago

Thanks for the extended example. This is Mojibake inside the (unicode-escaped) json fields of unesco in the jsondata column (mostly but maybe not all from interpreting utf-8 as cp1252).

Looks as if this is already present in the source XLS file, so we might report upstream.

@xrotwang: Maybe loader/unesco can use the XML file instead of the broken XLS?