skillachie / news-corpus-builder

Automatic News Corpus Builder
MIT License
40 stars 20 forks source link

encoding issue for example #3

Open jdxyw opened 7 years ago

jdxyw commented 7 years ago

Hi

I am trying the example.py, however got the error below. Is this a lib issue?

Traceback (most recent call last): File "example.py", line 94, in <module> ex.generate_corpus(article_links) File "/home/work/.jumbo/lib/python2.7/site-packages/news_corpus_builder/news_corpus_generator.py", line 101, in generate_corpus 'category':category}) File "/home/work/.jumbo/lib/python2.7/site-packages/news_corpus_builder/news_corpus_generator.py", line 105, in _save_article print "Saving article %s..." %(clean_article['title']) UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2018' in position 39: ordinal not in range(256)

rscarson commented 7 years ago

Also encountering this, but with charmap codec, not latin-1

I fixed it by replacing line 105 with:

print "Saving article %s..." %(clean_article['title'].encode("utf-8"))

in news_corpus_generator.py

shaffiqkhan commented 7 years ago

First of all thank you for such an awesome tool. if need to know three things. 1) can we query google news for " sentence" instead of words. 2) If i want to extract the creation time of news from the web page along with body and title. 3) If i want to extract news from specific time period ( e.g. specific date)

Thanks in advance.

skillachie commented 7 years ago

Will take a look and push an update soon. This was not meant for Py3 will look at making it compatible with both py2.x & py3.x

  1. can we query google news for " sentence" instead of words. You can do that currently by using quotes "This is the sentence you would like to search "

  2. If i want to extract the creation time of news from the web page along with body and title. Might not be able to obtain the creation time for all news articles. But an update could be made to extract and save the article date.

Will do it once I have the extra time. Feel free to submit a pull request

  1. If i want to extract news from specific time period ( e.g. specific date) However to search and get results for a specific date range we might not be able to do this with the news.google.com

Will probably have to update to using Google Web Search or another endpoint if a date parameter is not present for news.google.com