platonai / exotic-amazon

A complete solution to crawl amazon at scale completely and accurately.
143 stars 46 forks source link

Unable to get data when using a proxy #23

Open sskmtm opened 1 year ago

sskmtm commented 1 year ago

在不使用代理的情况下,main 分支代码可以正常运行

在使用代理的情况下,总是不能正确的获取页面(持续很长时间都没有正确的爬取页面)

爬取的日志总是( 💯 🔃 S for RR got 200 2.64 KiB <- 2.64 KiB):

19:23:44.914 [r-worker-1] INFO  a.p.p.c.component.LoadComponent.Task -  99. 💯 🔃 S for RR got 200 2.64 KiB <- 2.64 KiB [💿4.40 KiB] in 8.891s, last fetched 9s ago, fc:31 | 2/3/0/0/756 | nf:3/3/3      | 115.234.228.146 | 1IIdXw62 | file:///var/folders/vr/_8xgwfn14959gb617jpn7gv40000gp/T/ln/1f6ede83881b702b4a3c5ffa9b01ef51.htm | https://www.amazon.com/s?k=sport+shoes -parse -refresh

或者(💔 🔃 S for RR got 1601 2.64 KiB [💿4.40 KiB]):

20:16:37.104 [r-worker-4] INFO  a.p.p.c.component.LoadComponent.Task -  39. 💔 🔃 S for RR got 1601 2.64 KiB [💿4.40 KiB] in 1m8.373s, last fetched 1m9s ago, fc:1/42 Retry(1601) rs: Timeout to wait for document ready, rsp: CRAWL | 2/3/0/0/756 | nf:3/3/3      | 183.151.120.172 | 18qo0l66 | file:///var/folders/vr/_8xgwfn14959gb617jpn7gv40000gp/T/ln/1f6ede83881b702b4a3c5ffa9b01ef51.htm | https://www.amazon.com/s?k=sport+shoes -parse -refresh

其中, 爬取的链接:https://www.amazon.com/s?k=sport+shoes 参数:-parse -refresh

爬取的页面:

image

在本地测试过,相同的链接,都在使用代理的情况下: 老版本可以爬取下来 新版本就会出现上面的情况

platonai commented 1 year ago

之前的评论不准确。删了。

niudinlp commented 1 year ago

我也遇到这样的问题了,怎么解决?

platonai commented 1 year ago

估计是遇到反爬了。Amazon.com 如果检测到一个全新的浏览器一上来就开始搜索,它就会认为该访问是爬虫。

解决方法:在 onBrowserLaunched 事件中访问 referer 页面,快速打开快速关闭即可,让 amazon.com 看到合理的访问轨迹。

        val hyperlink = ListenableHyperlink(url)
        val be = hyperlink.event.browseEvent
        be.onBrowserLaunched.addLast { page, driver ->
            val warmUpUrl = "https://www.amazon.com/"
            logger.info("Browser launched, warm up with url | {}", warmUpUrl)
            driver.navigateTo(warmUpUrl)
        }