RegexField.extract()转中文乱码

fengdongfa1995 commented 4 years ago

RegexField.extract()函数在接受etree._Element对象时会将其转换为字符串，当前的转换方法无法正常转换中文，会将中文转化为乱码。

下面这段代码似乎可以正常工作

if isinstance(html, etree._Element):
    html = etree.tostring(html, encoding='utf-8', pretty_print=True, method='html').decode()

source: http://blog.sina.com.cn/s/blog_9e103b930102x1jx.html

最小代码示例：将examples下的douban_spider.py中的Item定义修改为：

target_item = TextField(xpath_select='//div[@class="info"]')
title_xpath = TextField(xpath_select='.//span[@class="title"]')
title_regex = RegexField(re_select=r'<span class="title">(.*?)</span>', re_flags=re.S)

title_xpath可以正常输出中文，但是title_regex会输出一把乱码:

[2020-04-19 22:21:45] INFO  DoubanSpider <Item {'title': '肖申克的救赎'}>  
[2020-04-19 22:21:45] INFO  DoubanSpider <Item {'title': '霸王别姬'}>

[2020-04-19 22:22:57] INFO  DoubanSpider <Item {'title': '&#32918;&#30003;&#20811;&#30340;&#25937;&#36174;'}>
[2020-04-19 22:22:57] INFO  DoubanSpider <Item {'title': '&#38712;&#29579;&#21035;&#23020;'}>

howie6879 commented 4 years ago

收到，我这边有时间先测试一下哈，如果没问题再合并PR，有时间你也可以继续详细测试一下

fengdongfa1995 commented 4 years ago

注意到field.py当中的HtmlField实际上也是将etree._Element对象转化为字符串，所以一并做了测试，它的源代码是：

class HtmlField(_LxmlElementField):
    """
    This field is used to get raw html data.
    """
    def _parse_element(self, element):
        return etree.tostring(element, encoding="utf-8").decode(encoding="utf-8")

它和正则表达式RegexField相比，多了一个encoding。

我测试用的代码修改自douban_spider.py，完整代码如下所示：

import re
from ruia import Item, Spider, TextField, RegexField, HtmlField

# 用三种不同的方式输出同一个字段
class DoubanItem(Item):
    title_xpath = TextField(xpath_select="//div[@class='hd']//span[@class='title']")
    title_html = HtmlField(xpath_select="//div[@class='hd']//span[@class='title']")
    title_regex = RegexField(r'<div class="hd">.*?<span class="title">(.*?)</span>', re_flags=re.S)

class DoubanSpider(Spider):
    name = "DoubanSpider"
    start_urls = ["https://movie.douban.com/top250"]

    async def parse(self, response):
        yield await DoubanItem.get_item(html=response.html)

    async def process_item(self, item: DoubanItem):
        self.logger.info(item)

if __name__ == "__main__":
    DoubanSpider.start()

得到的结果为：

<Item {'title_regex': '&#32918;&#30003;&#20811;&#30340;&#25937;&#36174;', 'title_xpath': ' 
肖申克的救赎', 'title_html': '<span class="title">肖申克的救赎</span>\n

显然正则表达式出现乱码，修改源代码，在RegexField的tostring里面加上encoding="utf-8"，程序输出结果为：

<Item {'title_xpath': '肖申克的救赎', 'title_regex': '肖申克的救赎', 'title_html': '<span class="title">肖申克的救赎</span>\n                                    '}>
[2020-04-20 13:54:01] INFO  DoubanSpider Stopping spider: DoubanSpider
[2020-04-20 13:54:01] INFO  DoubanSpider Total requests: 1
[2020-04-20 13:54:01] INFO  DoubanSpider Time usage: 0:00:00.397010
[2020-04-20 13:54:01] INFO  DoubanSpider Spider finished!
Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x00000161C01F31F0>
Traceback (most recent call last):
  File "d:\Anaconda3\envs\crawlers\lib\asyncio\proactor_events.py", line 116, in __del__
    self.close()
  File "d:\Anaconda3\envs\crawlers\lib\asyncio\proactor_events.py", line 108, in close
    self._loop.call_soon(self._call_connection_lost, None)
  File "d:\Anaconda3\envs\crawlers\lib\asyncio\base_events.py", line 719, in call_soon
    self._check_closed()
  File "d:\Anaconda3\envs\crawlers\lib\asyncio\base_events.py", line 508, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

程序能够输出正确的结果，但是会抛出一个错误，不知道为什么，在tostring的参数里面加上method="html"就不会抛出这个错误了。删除这个参数以后，错误又会出现。

因为我是使用的win10电脑，怀疑可能是系统的原因。改用我的另一台电脑，系统是ArchLinux，不带encoding='utf-8'会输出乱码，不带method='html'不会报出上面的错误，所以还是把这些参数都带上吧，毕竟我还是windows电脑用的比较多。

howie6879 commented 4 years ago

已合并

howie6879 / ruia

RegexField.extract()转中文乱码 #109