Closed NewUserHa closed 6 years ago
It looks like the logger is throwing UnicodeEncodeError
when trying to print your scraped data to the console log. There are some bytes in your scraped data that cannot be encoded with the gbk
codec, which I presume is your system default.
I get a similar result (even without the file logging) - but I have to force the encoding, since my system uses UTF-8:
import sys, io, logging
from scrapy.utils.log import configure_logging
from scrapy.spiders import Spider
# override the encoding for the logging streamhandler
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='gbk')
class PythonSpider(Spider):
name = 'myspider'
start_urls = ['http://www.python.org']
def parse(self, response):
return {'name': '\ufe0f'}
scrapy runspider test.py
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
stream.write(msg)
UnicodeEncodeError: 'gbk' codec can't encode character '\ufe0f' in position 104: illegal multibyte sequence
In this case, scrapy is trying to log the item {'name': '\ufe0f'}
, which cannot be encoded by the default streamhandler STDERR
, which is set to gbk
.
If you run python -c 'import locale; print(locale.getpreferredencoding())'
, you can see the encoding the logger will try and use.
You can workaround the problem by encoding with utf-8 in the spider itself:
return {'name': '\ufe0f'.encode('utf-8')}
Perhaps others have thoughts on the best place for a fix.
my console displays fine, but it was writing to file. there is <filename='log.txt',> is in my above code.
I have many items extracted from html. I don't think I can add .encode() to every line. and everything works fine except the new logging to file with different log level in addition to console/std output. (and I don't know their encoding. maybe they are just unicode codepoints?)
Is the error displaying on the console or in the log.txt
file? What is the output of python -c 'import locale; print(locale.getpreferredencoding())
on your system?
displaying on console of course. cp936
Ok, same issue, but with the file. Try this:
configure_logging(install_root_handler=False)
stream = FileHandler(filename='log.txt', mode='a', encoding='utf-8').stream
logging.basicConfig(
stream=stream,
format='%(levelname)s: %(message)s',
level=logging.INFO
)
I would be very surprised if you don't have the same issue on the console then, however - so if the issue persists, try disabling console logging.
import logging
from scrapy.utils.log import configure_logging
stream = logging.FileHandler(filename='log.txt', mode='a', encoding='utf-8').stream
configure_logging(install_root_handler=False)
logging.basicConfig(
level=logging.DEBUG
)
the code you posted above work incorrectly. your codes overridden scrapy's logging configures and it only output to file
I want scrapy output INFO to std, output DEBUG to file.
I copy the lines in question description from scrapy docs. how can it be wrong?
Please post a full example that demonstrates the problem.
sample:
import scrapy
import os
import logging
from scrapy.utils.log import configure_logging
configure_logging(install_root_handler=False)
logging.basicConfig(
filename='log.log',
filemode='a',
level=logging.DEBUG
# format='%(levelname)s: %(message)s',
)
class example(scrapy.Spider):
name = 'example'
custom_settings = {
'LOG_LEVEL': 'DEBUG',
}
start_urls = ['https://www.baidu.com']
def parse(self, response):
pass
return {'date': '12-02', 'floor': '...\n 电\ufe0f'}
if __name__ == "__main__":
os.environ["https_proxy"] = ''
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(example)
process.start()
as you may see, if the item is processed by scrapy, console has error
--- Logging error ---
Traceback (most recent call last):
File "C:\Program Files\Python35\lib\logging\__init__.py", line 982, in emit
stream.write(msg)
UnicodeEncodeError: 'gbk' codec can't encode character '\ufe0f' in position 87: illegal multibyte sequence
Call stack:
...
...
File "C:\Program Files\Python35\lib\site-packages\scrapy\core\scraper.py", line 237, in _itemproc_finished
logger.log(*logformatter_adapter(logkws), extra={'spider': spider})
Message: 'Scraped from %(src)s\r\n%(item)s'
Arguments: {'item': {'floor': '...\n 电\ufe0f', 'date': '12-02'}, 'src': <200 https://www.b...
in std output, but file have everything else except the DEBUG info of the item.
-- my cmd chcp is 936. I'm not familiar with logging and how scrapy implement logging inside. But I think scrapy configure_logging settings may be the fault
This code works for me, because my system is UTF-8. Go back through this thread again. The answers are there.
Your system locale does not support some of the scraped content, which the logger is attempting to log. Whether you write it to a file or to the console (or both), it will result in a UnicodeEncodeError
. You have to tell the logger what encoding to use.
For the file, you can manually provide a stream to the logger with the encoding set. Make sure you define the stream with stream = FileHandler(filename='log.txt', mode='a', encoding='utf-8').stream
, and then provide it to the logger with stream=stream
in the logging.basicConfig
parameters. You missed the second step above.
For the console, if you get the same error, adjust your default character encoding on the system, or override it with io.TextIOWrapper
as described above.
I added `import logging from scrapy.utils.log import configure_logging
configure_logging(install_root_handler=False) logging.basicConfig( filename='log.txt', format='%(levelname)s: %(message)s', level=logging.INFO )` from scrapy documents blow my class define.
then: --- Logging error --- Traceback (most recent call last): File "C:\Program Files\Python35\lib\logging__init__.py", line 982, in emit stream.write(msg) UnicodeEncodeError: 'gbk' codec can't encode character '\ufe0f' in position 190: illegal multibyte sequence Call stack: .... File "C:\Program Files\Python35\lib\site-packages\scrapy\core\scraper.py", line 237, in _itemproc_finished logger.log(*logformatter_adapter(logkws), extra={'spider': spider}) Message: 'Scraped from %(src)s\r\n%(item)s' Arguments: {'src': <200 http://...>, 'item': {'date': '12-02', 'floor': '...\n 电\ufe0f', 'pics': ['...', ...]}}
I also have 'LOG_LEVEL': 'INFO' in the spider.
I googled and have no idea how to fix this. any help, please? Thanks!