platonai / exotic-amazon

A complete solution to crawl amazon at scale completely and accurately.
148 stars 48 forks source link

cycle crawl product reviews fail #12

Open sskmtm opened 1 year ago

sskmtm commented 1 year ago

循环爬取 /prudct-reviews/... 页面的内容,

第一页爬取是正常的,当爬取到第二页的时候出了问题,爬取到的文件内容如下:

请问,这种情况应该怎么解决(未使用代理)?

image
platonai commented 1 year ago

亚马逊不接受直接访问 review 页面。

1/ 你必须先访问其他页面 2/ 最好是先访问包含该 review 链接的页面 3/ 另一种可尝试的方案是,你可以尝试修改 pulsarr 源代码,在 WebDriver 中增加 API,来修改请求的 header 信息,在该 header 信息中增加 referrer 头

参考: selenium是不是能完全取代requests? https://www.zhihu.com/question/361685508/answer/2738050570

关键信息:

最复杂的数据采集项目可以使用 RPA 模式:

最复杂的数据采集项目往往需要和网页进行复杂交互,为此我们提供了简洁强大的 API。以下是一个典型的 RPA 代码片段,它是从顶级电子商务网站收集数据所必需的:

val options = session.options(args)
val event = options.event.browseEvent
event.onBrowserLaunched.addLast { page, driver ->
    // warp up the browser to avoid being blocked by the website,
    // or choose the global settings, such as your location.
    warnUpBrowser(page, driver)
}
event.onWillFetch.addLast { page, driver ->
    // have to visit a referrer page before we can visit the desired page
    waitForReferrer(page, driver)
    // websites may prevent us from opening too many pages at a time, so we should open links one by one.
    waitForPreviousPage(page, driver)
}
event.onWillCheckDocumentState.addLast { page, driver ->
    // wait for a special fields to appear on the page
    driver.waitForSelector("body h1[itemprop=name]")
    // close the mask layer, it might be promotions, ads, or something else.
    driver.click(".mask-layer-close-button")
}
// visit the URL and trigger events
session.load(url, options)