howie6879 / ruia

Async Python 3.6+ web scraping micro-framework based on asyncio
https://www.howie6879.com/ruia/
Apache License 2.0
1.75k stars 181 forks source link

RegexField.extract()转中文乱码 #109

Closed fengdongfa1995 closed 4 years ago

fengdongfa1995 commented 4 years ago

RegexField.extract()函数在接受etree._Element对象时会将其转换为字符串,当前的转换方法无法正常转换中文,会将中文转化为乱码。

下面这段代码似乎可以正常工作

if isinstance(html, etree._Element):
    html = etree.tostring(html, encoding='utf-8', pretty_print=True, method='html').decode()

source: http://blog.sina.com.cn/s/blog_9e103b930102x1jx.html

最小代码示例: 将examples下的douban_spider.py中的Item定义修改为:

target_item = TextField(xpath_select='//div[@class="info"]')
title_xpath = TextField(xpath_select='.//span[@class="title"]')
title_regex = RegexField(re_select=r'<span class="title">(.*?)</span>', re_flags=re.S)

title_xpath可以正常输出中文,但是title_regex会输出一把乱码:

[2020-04-19 22:21:45] INFO  DoubanSpider <Item {'title': '肖申克的救赎'}>  
[2020-04-19 22:21:45] INFO  DoubanSpider <Item {'title': '霸王别姬'}> 
[2020-04-19 22:22:57] INFO  DoubanSpider <Item {'title': '&#32918;&#30003;&#20811;&#30340;&#25937;&#36174;'}>
[2020-04-19 22:22:57] INFO  DoubanSpider <Item {'title': '&#38712;&#29579;&#21035;&#23020;'}>
howie6879 commented 4 years ago

收到,我这边有时间先测试一下哈,如果没问题再合并PR,有时间你也可以继续详细测试一下

fengdongfa1995 commented 4 years ago

注意到field.py当中的HtmlField实际上也是将etree._Element对象转化为字符串,所以一并做了测试,它的源代码是:

class HtmlField(_LxmlElementField):
    """
    This field is used to get raw html data.
    """
    def _parse_element(self, element):
        return etree.tostring(element, encoding="utf-8").decode(encoding="utf-8")

它和正则表达式RegexField相比,多了一个encoding

我测试用的代码修改自douban_spider.py,完整代码如下所示:

import re
from ruia import Item, Spider, TextField, RegexField, HtmlField

# 用三种不同的方式输出同一个字段
class DoubanItem(Item):
    title_xpath = TextField(xpath_select="//div[@class='hd']//span[@class='title']")
    title_html = HtmlField(xpath_select="//div[@class='hd']//span[@class='title']")
    title_regex = RegexField(r'<div class="hd">.*?<span class="title">(.*?)</span>', re_flags=re.S)

class DoubanSpider(Spider):
    name = "DoubanSpider"
    start_urls = ["https://movie.douban.com/top250"]

    async def parse(self, response):
        yield await DoubanItem.get_item(html=response.html)

    async def process_item(self, item: DoubanItem):
        self.logger.info(item)

if __name__ == "__main__":
    DoubanSpider.start()

得到的结果为:

<Item {'title_regex': '&#32918;&#30003;&#20811;&#30340;&#25937;&#36174;', 'title_xpath': ' 
肖申克的救赎', 'title_html': '<span class="title">肖申克的救赎</span>\n

显然正则表达式出现乱码,修改源代码,在RegexFieldtostring里面加上encoding="utf-8",程序输出结果为:

<Item {'title_xpath': '肖申克的救赎', 'title_regex': '肖申克的救赎', 'title_html': '<span class="title">肖申克的救赎</span>\n                                    '}>
[2020-04-20 13:54:01] INFO  DoubanSpider Stopping spider: DoubanSpider
[2020-04-20 13:54:01] INFO  DoubanSpider Total requests: 1
[2020-04-20 13:54:01] INFO  DoubanSpider Time usage: 0:00:00.397010
[2020-04-20 13:54:01] INFO  DoubanSpider Spider finished!
Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x00000161C01F31F0>
Traceback (most recent call last):
  File "d:\Anaconda3\envs\crawlers\lib\asyncio\proactor_events.py", line 116, in __del__
    self.close()
  File "d:\Anaconda3\envs\crawlers\lib\asyncio\proactor_events.py", line 108, in close
    self._loop.call_soon(self._call_connection_lost, None)
  File "d:\Anaconda3\envs\crawlers\lib\asyncio\base_events.py", line 719, in call_soon
    self._check_closed()
  File "d:\Anaconda3\envs\crawlers\lib\asyncio\base_events.py", line 508, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

程序能够输出正确的结果,但是会抛出一个错误,不知道为什么,在tostring的参数里面加上method="html"就不会抛出这个错误了。删除这个参数以后,错误又会出现。

因为我是使用的win10电脑,怀疑可能是系统的原因。改用我的另一台电脑,系统是ArchLinux,不带encoding='utf-8'会输出乱码,不带method='html'不会报出上面的错误,所以还是把这些参数都带上吧,毕竟我还是windows电脑用的比较多。

howie6879 commented 4 years ago

已合并