sherpya / geolite2legacy

MaxMind GeoLite2 (CSV) to Legacy format converter
MIT License
254 stars 88 forks source link

Question to encoding #14

Open HansMeiser234 opened 5 years ago

HansMeiser234 commented 5 years ago

Hello,

thanks for your converter, i want to use it, but currently i have an issue with encoding. I test this IP: 91.38.193.110 City is called Füssen (german umlaut) When using the converted db on console geoiplookup shows city as Füssen. In my utf-8 putty i would expect to see a correct umlaut when using this encoding. Whats wrong here? https://geolite.maxmind.com/download/geoip/database/GeoLite2-City-CSV.zip geolite2legacy.py -i GeoLite2-City-CSV.zip -f geoname2fips.csv -e utf-8 -o GeoLiteCity.dat In csv file itself the umlaut seems to be correct, i can see a gorgeous ü when grepping in GeoLite2-City-Locations-de.csv

What do you think?

Thanks, Hans

sherpya commented 5 years ago

Hi, geolite legacy database was not made for utf-8 so clients may or may not use it correctly. The fact the ü is displayed as two chars makes me things that it's utf-8 encoded in the dat file but it is the client screwing the output, because it does not decoded it as utf-8.

What version of python are you using? 2 or 3?

sherpya commented 5 years ago

for example pygeoip is not able to handle utf-8 data, with python3 (or python2 with a working utf-8 locale) you can use a trick to correctly pick the utf-8 string (I've tried to play with the internals instead but not easy to fix):

import pygeoip
m = pygeoip.GeoIP('x.dat')
city = m.record_by_addr('91.38.193.110')['city']
city = city.encode(pygeoip.ENCODING).decode('utf-8')  # iso-8859-1
print(city)

result: Füssen

HansMeiser234 commented 5 years ago

Hello,

thanks for your answer.

but it is the client screwing the output, because it does not decoded it as utf-8. i think that too, /usr/bin/geoiplookup is part of bundled geoip-bin and may expect data in iso-8859-1 What version of python are you using? 2 or 3? this is python2. In python3 (3.6.7) is no modul ipadrr, the alternative is called ipaddress. https://pypi.org/project/ipaddr/ this provides other functions then ipaddr, so i think current geolite2legacy.py can not used with python3.

If i explicitely use -e iso-8859-1 for encoding i receive a lot of errors like this Warning cannot encode u'Hachi\u014dji' using iso-8859-1 I took a closer look to data in csv file. I see a lot of city-names/regions (for example japan Hachiōji) which use utf8 encoded chars, which i think are not convertable to iso. this may be the reason why you use utf8 as default encoding. is this a change in new db format? did they exclude such cities in former version of legacy-db? Unfortunately i dont know specific IPs to test former output of geoiplookup.

Thanks, Hans

HansMeiser234 commented 5 years ago

Sorry, inline commenting failed in above text. also my text is marked as comment. Hans

HansMeiser234 commented 5 years ago

Hello,

while testing i discovered an other issue. I miss a lot city data for japanese cities. for example i test with 117.55.223.153 an get result with old db GeoIP Country Edition: JP, Japan GeoIP City Edition, Rev 1: JP, 19, Kanagawa, Kawasaki, 210-0835, 35.520599, 139.717194, 0, 0 GeoIP ASNum Edition: AS10021 KVH Co.,Ltd

The converted version shows: GeoIP Country Edition: JP, Japan GeoIP City Edition, Rev 1: JP, 00, N/A, N/A, N/A, 35.689999, 139.690002, 0, 0 GeoIP ASNum Edition: AS10021 KVH Co.,Ltd

In zipped csv File GeoLite2-City-Locations-de.csv or GeoLite2-City-Locations-en.csv i successfully find these cities. is it possbile that there was a loss in conversion? I do it this way: geolite2legacy.py -i "GeoLite2-City-CSV.zip -f geoname2fips.csv -o GeoLiteCity.dat

Thanks, Hans

HansMeiser234 commented 5 years ago

Gianluigi , still alive? I thought you are interested in these things?

sherpya commented 5 years ago

are you sure? with that ip I get:

{
    "area_code": 0,
    "city": "Toshima",
    "continent": "AS",
    "country_code": "JP",
    "country_code3": "JPN",
    "country_name": "Japan",
    "dma_code": 0,
    "latitude": 35.72630000000001,
    "longitude": 139.6859,
    "metro_code": null,
    "postal_code": "171-0052",
    "region_code": "00",
    "time_zone": "Asia/Tokyo"
}
HansMeiser234 commented 5 years ago

Did you change something? I downloaded again latest geolite2legacy.py and again db data. now i get: GeoIP City Edition, Rev 1: JP, 00, N/A, Toshima, 171-0052, 35.726299, 139.685898, 0, 0

sherpya commented 5 years ago

python3 using GeoLite2-City-CSV_20190610.zip

hege-li commented 4 years ago

Check my patch to convert names to plain ascii with unidecode, been using it for a while.

https://github.com/sherpya/geolite2legacy/pull/21

$ ./geolite2legacy.py -i GeoLite2-City-CSV.zip -o GeoIPCity.dat $ geoiplookup -f GeoIPCity.dat 178.17.166.99 GeoIP City Edition, Rev 1: MD, 00, N/A, FÃ ÂleÃÂti, MD-5901, 47.573601, 27.709200, 0, 0

$ ./geolite2legacy.py -i GeoLite2-City-CSV.zip -e latin-1 -o GeoIPCity_latin1.dat $ geoiplookup -f GeoIPCity_latin1.dat 178.17.166.99 GeoIP City Edition, Rev 1: MD, 00, N/A, F?le?ti, MD-5901, 47.573601, 27.709200, 0, 0

$ ./geolite2legacy.py -i GeoLite2-City-CSV.zip -e latin-1 -u -o GeoIPCity_latin1_unidecode.dat $ geoiplookup -f GeoIPCity_latin1_unidecode.dat 178.17.166.99 GeoIP City Edition, Rev 1: MD, 00, N/A, Falesti, MD-5901, 47.573601, 27.709200, 0, 0

HansM-200 commented 3 years ago

Hello,

for the sake of completeness i have to tell i could not test any more. 2 years ago a changed my company and have complete different workthemes. I forwarded this 2 years ago to old mates and i think they still use it. Just in case you wonder about missing comments ;)

Hans