python-ruia / ruia-pyppeteer

A Ruia plugin for loading javascript - pyppeteer
MIT License
18 stars 5 forks source link

AttributeError: 'PyppeteerResponse' object has no attribute 'html' #11

Closed lucays closed 2 years ago

lucays commented 2 years ago

直接新建文件运行示例代码,鼠标选中部分response.html这一步会报错: AttributeError: 'PyppeteerResponse' object has no attribute 'html'

image

debug发现确实没有这个属性 image

howie6879 commented 2 years ago

应该是我文档没有更新,你使用 await response.text() 试试看

lucays commented 2 years ago

应该是我文档没有更新,你使用 await response.text() 试试看

额,也不行,item里的_get_html()方法需要这个是str,但是.text或者.text()都不是,括号内用html=await response.text()?... 即使这样也会报错: pyppeteer.errors.NetworkError: Protocol Error (Network.getResponseBody): Session closed. Most likely the page has been closed. 这个报错也许只是pyppeteer本身的问题了。。 image

howie6879 commented 2 years ago

我明天调试一下哈

howie6879 commented 2 years ago

@lucays 已修复:

pip install ruia-pyppeteer==0.0.8

代码:

from ruia import AttrField, Item, TextField

from ruia_pyppeteer import PyppeteerSpider as Spider

class JianshuItem(Item):
    target_item = TextField(css_select="ul.list>li")
    author_name = TextField(css_select="a.name")
    author_url = AttrField(attr="href", css_select="a.name")

    async def clean_author_name(self, author_name):
        return author_name.strip()

    async def clean_author_url(self, author_url):
        return f"https://www.jianshu.com{author_url}"

class JianshuSpider(Spider):
    start_urls = ["https://www.jianshu.com/"]
    concurrency = 10

    async def parse(self, response):
        html = await response.page.content()
        async for item in JianshuItem.get_items(html=html):
            # Loading js by using PyppeteerRequest
            print(item)
        await response.browser.close()

if __name__ == "__main__":
    JianshuSpider.start()

输出:

image

lucays commented 2 years ago

@lucays 已修复:

pip install ruia-pyppeteer==0.0.8

代码:

from ruia import AttrField, Item, TextField

from ruia_pyppeteer import PyppeteerSpider as Spider

class JianshuItem(Item):
    target_item = TextField(css_select="ul.list>li")
    author_name = TextField(css_select="a.name")
    author_url = AttrField(attr="href", css_select="a.name")

    async def clean_author_name(self, author_name):
        return author_name.strip()

    async def clean_author_url(self, author_url):
        return f"https://www.jianshu.com{author_url}"

class JianshuSpider(Spider):
    start_urls = ["https://www.jianshu.com/"]
    concurrency = 10

    async def parse(self, response):
        html = await response.page.content()
        async for item in JianshuItem.get_items(html=html):
            # Loading js by using PyppeteerRequest
            print(item)
        await response.browser.close()

if __name__ == "__main__":
    JianshuSpider.start()

输出:

image

测试确实已修复,非常感谢! 有个小问题目前0.0.8还没上传到pypi,需要 pip install git+https://github.com/ruia-plugins/ruia-pyppeteer

另外就是,是否可以不手动close...有with就更好了

howie6879 commented 2 years ago

有个小问题目前0.0.8还没上传到pypi

已经上传的,可能你用了国内源

不手动close...有with

这个不满足实际使用条件的