nspcc-dev / locode-db

Source of UN/LOCODE database generated by NeoFS CLI.
MIT License
3 stars 6 forks source link

Utf 8 SubdivisionCodes #23

Closed AliceInHunterland closed 10 months ago

AliceInHunterland commented 10 months ago

ref. https://github.com/nspcc-dev/neofs-node/issues/2637 Can we rely on the data, which @carpawell proposed? Also, I found https://www.ip2location.com/free/iso3166-2, but it consists of less information.

So the problem is in the "long" description? That is strange. Seems like SubdivisionCodes.csv is the only file with encoding problems. But there is some UTF-8 version of it: https://github.com/datasets/un-locode.

carpawell commented 10 months ago

Can we rely on the data, which @carpawell proposed?

If you want to do it, double-check it, please. The only thing I wanted to say was that it is possible to format it and somebody has done it successfully. We should either find a good and available source of utf-8 files or make it possible to format ourselves automatically cause the database is updated at least once per year.

BTW, it is still an unknown encoding to me in the SubdivisionCodes.csv file.

AliceInHunterland commented 10 months ago

@carpawell @roman-khimov i've converted our csv to valid utf-8 strings, it rest some mismatches in symbols (less than 100 records have "?" inside) and can be overwrite manually, but it wont break IR as it does now.

carpawell commented 10 months ago

i've converted our csv to valid utf-8 strings

@AliceInHunterland, how did you do this? Manually?

AliceInHunterland commented 10 months ago

i've converted our csv to valid utf-8 strings

@AliceInHunterland, how did you do this? Manually?

with charmap.ISO8859_1 and charmap.Windows1256

carpawell commented 10 months ago

with charmap.ISO8859_1 and charmap.Windows1256

Oh, sorry, had not looked at the code yet when wrote the comment.