Closed jrpool closed 8 years ago
Are you sure you refer to the right code points?
>>> import unicodedata
>>> unicodedata.name(u'\u00c3')
'LATIN CAPITAL LETTER A WITH TILDE'
>>> unicodedata.name(u'\u2030')
'PER MILLE SIGN'
>>> unicodedata.name(u'\u0089', None)
>>> unicodedata.normalize('NFC', u'\N{LATIN CAPITAL LETTER E WITH ACUTE}')
u'\xc9'
>>> unicodedata.normalize('NFD', u'\N{LATIN CAPITAL LETTER E WITH ACUTE}')
u'E\u0301'
Maybe you did not use the right encoding when reading the file.
>>> import io, collections
>>> chars = collections.Counter(io.open('languoid.csv', encoding='utf-8').read())
>>> [chars[c] for c in u'\u00c3\u2030\u0089']
[0, 0, 0]
>>> [chars[c] for c in u'E\u00c9\u0301']
[9829, 0, 0]
I am having difficulty understand this reply. Here is one example of the pattern that I reported:
assiniboine (\u00c3\u2030tats-Unis d'Am\u00c3\u00a9rique)
The character before "tats" is, presumably, intended to be É.
Thanks for the extended example. This is Mojibake inside the (unicode-escaped) json fields of unesco
in the jsondata
column (mostly but maybe not all from interpreting utf-8
as cp1252
).
Looks as if this is already present in the source XLS file, so we might report upstream.
@xrotwang: Maybe loader/unesco can use the XML file instead of the broken XLS?
In several places, the capital letter E with acute accent is miscoded as \u00c3\u2030 instead of \u00c3\u0089.