grangier / python-goose

Html Content / Article Extractor, web scrapping lib in Python
Apache License 2.0
3.98k stars 788 forks source link

Unicode encoding problems while check stop words #136

Open vladimir-shmidt opened 10 years ago

vladimir-shmidt commented 10 years ago

Have tried to extract russian article but gosse produced empty result. I tried to debug and have found out that extracted content (text from p tag) can not be found in loaded stop list. But it is 100% in the stop list. So i suppose it is the string eqauls problem in python or something fimilar. In the right bottom coner i've added watch items. So it is currnet word. Eqauls result of set and stop word position of current word. image

vladimir-shmidt commented 10 years ago

suppose changes in class StopWords(object): self._cached_stop_words[language] = set(FileHelper.loadResourceFile(path).encode('utf-8').splitlines()) will solve the issue

grangier commented 10 years ago

I supposed you're stopword file is not correctly encoded

vladimir-shmidt commented 10 years ago

i haven't changed anything with it.