ushahidi / geograpy

Extract countries, regions and cities from a URL or text
219 stars 133 forks source link

UnicodeDecodeError: 'charmap' codec can't decode... #29

Open VanessaVanG opened 6 years ago

VanessaVanG commented 6 years ago

Did @PandaWhoCodes pip install git+https://github.com/reach2ashish/geograpy.git plus nltk.downloader.download('maxent_ne_chunker') nltk.downloader.download('words') nltk.downloader.download('treebank') nltk.downloader.download('maxent_treebank_pos_tagger') nltk.downloader.download('punkt') nltk.download('averaged_perceptron_tagger')

and it seemed to be going well until I tried the example url = 'http://www.bbc.com/news/world-europe-26919928' places = geograpy.get_place_context(url=url)

I get UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 274: character maps to <undefined>

Python 3.6 Windows Any thoughts? (or alternatives? I need to pull out city names. I've used GeoText for the country names (not positive it's working right yet) but GeoText's cities doesn't work very well.)

sergeiGKS commented 5 years ago

Same issue.

sergeiGKS commented 5 years ago

@VanessaVanG,

in line 25 of places.py file:

instead of with open(cur_dir+"/data/GeoLite2-City-Locations.csv", "rb") as info:

put this with open(cur_dir+"/data/GeoLite2-City-Locations.csv", "rt", encoding="utf-8") as info:

srinisc commented 5 years ago

Will this issue be fixed in an upcoming release?

ghost commented 5 years ago

Unfortunately that fix still doesn't work for me @sergeiGKS

yougha54 commented 5 years ago

@VanessaVanG @sergeiGKS if you delete the 4th line of data/GeoLite2-City-Locations.csv, it should work.

SamDean332 commented 4 years ago

I am still getting this even trying both fixes. I know it is because Excel file contains quite a few odd characters, but the encoding does not seem to work. I can remove Char by char to change the error position, but do not know how to get it all.

Python 3.7 on Windows 10

urls = hits['link'].values for url in urls: place = geograpy.get_place_context(url=url) print(place)

SamDean332 commented 4 years ago

After some investigation, this is a Windows vs Linux error in some cases. Even using the

with open(cur_dir + "/data/GeoLite2-City-Locations.csv", encoding="utf-8") as info: I could not resolve the error on my Windows computer. However, the exact same code ran fine on a Linux computer I use as well. I looked in the the City-Locations.csv file on Linux, and it appeared LibreOffice automatically encoded and/or resolved all the characters. Where as looking at the same file in Excel, I would still have all the funky characters causing the error. Excel for some reason insists on keeping the odd characters.