scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.63k stars 10.41k forks source link

configure_logging unable to handle GBK #3057

Closed NewUserHa closed 6 years ago

NewUserHa commented 6 years ago

I added `import logging from scrapy.utils.log import configure_logging

configure_logging(install_root_handler=False) logging.basicConfig( filename='log.txt', format='%(levelname)s: %(message)s', level=logging.INFO )` from scrapy documents blow my class define.

then: --- Logging error --- Traceback (most recent call last): File "C:\Program Files\Python35\lib\logging__init__.py", line 982, in emit stream.write(msg) UnicodeEncodeError: 'gbk' codec can't encode character '\ufe0f' in position 190: illegal multibyte sequence Call stack: .... File "C:\Program Files\Python35\lib\site-packages\scrapy\core\scraper.py", line 237, in _itemproc_finished logger.log(*logformatter_adapter(logkws), extra={'spider': spider}) Message: 'Scraped from %(src)s\r\n%(item)s' Arguments: {'src': <200 http://...>, 'item': {'date': '12-02', 'floor': '...\n 电\ufe0f', 'pics': ['...', ...]}}

I also have 'LOG_LEVEL': 'INFO' in the spider.

I googled and have no idea how to fix this. any help, please? Thanks!

jschnurr commented 6 years ago

It looks like the logger is throwing UnicodeEncodeError when trying to print your scraped data to the console log. There are some bytes in your scraped data that cannot be encoded with the gbk codec, which I presume is your system default.

I get a similar result (even without the file logging) - but I have to force the encoding, since my system uses UTF-8:

import sys, io, logging
from scrapy.utils.log import configure_logging
from scrapy.spiders import Spider

# override the encoding for the logging streamhandler
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='gbk')

class PythonSpider(Spider):
    name = 'myspider'
    start_urls = ['http://www.python.org']

    def parse(self, response):
        return {'name': '\ufe0f'}
scrapy runspider test.py
--- Logging error ---
Traceback (most recent call last):
  File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
    stream.write(msg)
UnicodeEncodeError: 'gbk' codec can't encode character '\ufe0f' in position 104: illegal multibyte sequence

In this case, scrapy is trying to log the item {'name': '\ufe0f'}, which cannot be encoded by the default streamhandler STDERR, which is set to gbk.

If you run python -c 'import locale; print(locale.getpreferredencoding())', you can see the encoding the logger will try and use.

You can workaround the problem by encoding with utf-8 in the spider itself:

return {'name': '\ufe0f'.encode('utf-8')}

Perhaps others have thoughts on the best place for a fix.

NewUserHa commented 6 years ago

my console displays fine, but it was writing to file. there is <filename='log.txt',> is in my above code.

I have many items extracted from html. I don't think I can add .encode() to every line. and everything works fine except the new logging to file with different log level in addition to console/std output. (and I don't know their encoding. maybe they are just unicode codepoints?)

jschnurr commented 6 years ago

Is the error displaying on the console or in the log.txt file? What is the output of python -c 'import locale; print(locale.getpreferredencoding()) on your system?

NewUserHa commented 6 years ago

displaying on console of course. cp936

jschnurr commented 6 years ago

Ok, same issue, but with the file. Try this:

configure_logging(install_root_handler=False)
stream = FileHandler(filename='log.txt', mode='a', encoding='utf-8').stream
logging.basicConfig(
    stream=stream,
    format='%(levelname)s: %(message)s',
    level=logging.INFO
)

I would be very surprised if you don't have the same issue on the console then, however - so if the issue persists, try disabling console logging.

NewUserHa commented 6 years ago
import logging
from scrapy.utils.log import configure_logging
stream = logging.FileHandler(filename='log.txt', mode='a', encoding='utf-8').stream
configure_logging(install_root_handler=False)
logging.basicConfig(
    level=logging.DEBUG
)

the code you posted above work incorrectly. your codes overridden scrapy's logging configures and it only output to file

I want scrapy output INFO to std, output DEBUG to file.

I copy the lines in question description from scrapy docs. how can it be wrong?

jschnurr commented 6 years ago

Please post a full example that demonstrates the problem.

NewUserHa commented 6 years ago

sample:

import scrapy
import os
import logging
from scrapy.utils.log import configure_logging

configure_logging(install_root_handler=False)
logging.basicConfig(
    filename='log.log',
    filemode='a',
    level=logging.DEBUG
    # format='%(levelname)s: %(message)s',
)

class example(scrapy.Spider):
    name = 'example'

    custom_settings = {
        'LOG_LEVEL': 'DEBUG',
    }

    start_urls = ['https://www.baidu.com']

    def parse(self, response):
        pass
        return {'date': '12-02', 'floor': '...\n 电\ufe0f'}

if __name__ == "__main__":
    os.environ["https_proxy"] = ''
    from scrapy.crawler import CrawlerProcess
    process = CrawlerProcess()
    process.crawl(example)
    process.start()

as you may see, if the item is processed by scrapy, console has error

--- Logging error ---
Traceback (most recent call last):
  File "C:\Program Files\Python35\lib\logging\__init__.py", line 982, in emit
    stream.write(msg)
UnicodeEncodeError: 'gbk' codec can't encode character '\ufe0f' in position 87: illegal multibyte sequence
Call stack:
...
...
  File "C:\Program Files\Python35\lib\site-packages\scrapy\core\scraper.py", line 237, in _itemproc_finished
    logger.log(*logformatter_adapter(logkws), extra={'spider': spider})
Message: 'Scraped from %(src)s\r\n%(item)s'
Arguments: {'item': {'floor': '...\n 电\ufe0f', 'date': '12-02'}, 'src': <200 https://www.b...

in std output, but file have everything else except the DEBUG info of the item.

-- my cmd chcp is 936. I'm not familiar with logging and how scrapy implement logging inside. But I think scrapy configure_logging settings may be the fault

jschnurr commented 6 years ago

This code works for me, because my system is UTF-8. Go back through this thread again. The answers are there.

Your system locale does not support some of the scraped content, which the logger is attempting to log. Whether you write it to a file or to the console (or both), it will result in a UnicodeEncodeError. You have to tell the logger what encoding to use.

For the file, you can manually provide a stream to the logger with the encoding set. Make sure you define the stream with stream = FileHandler(filename='log.txt', mode='a', encoding='utf-8').stream, and then provide it to the logger with stream=stream in the logging.basicConfig parameters. You missed the second step above.

For the console, if you get the same error, adjust your default character encoding on the system, or override it with io.TextIOWrapper as described above.