samwize / python-email-crawler

Search on Google, and crawls for emails related to the result
292 stars 127 forks source link

URL with UTF-8 char triggers crash #17

Closed freetom closed 6 years ago

freetom commented 6 years ago

The crawler crashes when a URL with a non-ascii character is encountered (e.g 'ß')

Crash log:

...
[17:11:08] INFO::email_crawler - Crawling https://www.paginegialle.it/valle-aurina-bz/enti-turistici/alpinwellt-weißenbach
[17:11:09] ERROR::email_crawler - EXCEPTION: (pysqlite2.dbapi2.ProgrammingError) You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. [SQL: u'SELECT website.id, website.url, website.has_crawled, website.emails \nFROM website \nWHERE website.url = ?'
Traceback (most recent call last):
  File "email_crawler.py", line 217, in <module>
    crawl(arg)
  File "email_crawler.py", line 81, in crawl
    email_set = find_emails_2_level_deep(uncrawled.url)
  File "email_crawler.py", line 143, in find_emails_2_level_deep
    db.enqueue(link, list(email_set))
  File "/home/tomas/python-email-crawler2/database.py", line 35, in enqueue
    res = self.connection.execute(s)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 945, in execute
    return meth(self, multiparams, params)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1189, in _execute_context
    context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1402, in _handle_dbapi_exception
    exc_info
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1182, in _execute_context
    context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 470, in do_execute
    cursor.execute(statement, parameters)
ProgrammingError: (pysqlite2.dbapi2.ProgrammingError) You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings. [SQL: u'SELECT website.id, website.url, website.has_crawled, website.emails \nFROM website \nWHERE website.url = ?'] [parameters: ('https://www.paginegialle.it/valle-aurina-bz/enti-turistici/alpinwellt-wei\xc3\x9fenbach',)]

However, it may be that the issues is located in the sqlalchemy lib but I don't know for sure.

kopevgale commented 6 years ago

ok, maybe we can fix it

freetom commented 6 years ago

Furthermore, running the crawler with a non-ascii character(s) (such as 'à' in the search term also provokes a crash..

[17:48:26] INFO::email_crawler - ----------------------------------------
[17:48:26] ERROR::email_crawler - EXCEPTION: 'ascii' codec can't decode byte 0xc3 in position 24: ordinal not in range(128) 
Traceback (most recent call last):
  File "email_crawler.py", line 217, in <module>
    crawl(arg)
  File "email_crawler.py", line 57, in crawl
    logger.info("Keywords to Google for: %s" % keywords)
  File "/usr/lib/python2.7/logging/__init__.py", line 1167, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python2.7/logging/__init__.py", line 1286, in _log
    self.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1296, in handle
    self.callHandlers(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 1336, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python2.7/logging/__init__.py", line 759, in handle
    self.emit(record)
  File "/home/tomas/python-email-crawler2/ColorStreamHandler.py", line 38, in emit
    record.msg = record.msg.encode('utf-8', 'ignore')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 24: ordinal not in range(128)
freetom commented 6 years ago

Fixed