Unable to get data when using a proxy

sskmtm commented 1 year ago

在不使用代理的情况下，main 分支代码可以正常运行

在使用代理的情况下，总是不能正确的获取页面（持续很长时间都没有正确的爬取页面）

爬取的日志总是（ 💯 🔃 S for RR got 200 2.64 KiB <- 2.64 KiB）：

19:23:44.914 [r-worker-1] INFO  a.p.p.c.component.LoadComponent.Task -  99. 💯 🔃 S for RR got 200 2.64 KiB <- 2.64 KiB [💿4.40 KiB] in 8.891s, last fetched 9s ago, fc:31 | 2/3/0/0/756 | nf:3/3/3      | 115.234.228.146 | 1IIdXw62 | file:///var/folders/vr/_8xgwfn14959gb617jpn7gv40000gp/T/ln/1f6ede83881b702b4a3c5ffa9b01ef51.htm | https://www.amazon.com/s?k=sport+shoes -parse -refresh

或者（💔 🔃 S for RR got 1601 2.64 KiB [💿4.40 KiB]）：

20:16:37.104 [r-worker-4] INFO  a.p.p.c.component.LoadComponent.Task -  39. 💔 🔃 S for RR got 1601 2.64 KiB [💿4.40 KiB] in 1m8.373s, last fetched 1m9s ago, fc:1/42 Retry(1601) rs: Timeout to wait for document ready, rsp: CRAWL | 2/3/0/0/756 | nf:3/3/3      | 183.151.120.172 | 18qo0l66 | file:///var/folders/vr/_8xgwfn14959gb617jpn7gv40000gp/T/ln/1f6ede83881b702b4a3c5ffa9b01ef51.htm | https://www.amazon.com/s?k=sport+shoes -parse -refresh

其中，爬取的链接：https://www.amazon.com/s?k=sport+shoes 参数：-parse -refresh

爬取的页面：

在本地测试过，相同的链接，都在使用代理的情况下：老版本可以爬取下来新版本就会出现上面的情况

platonai commented 1 year ago

之前的评论不准确。删了。

niudinlp commented 1 year ago

我也遇到这样的问题了，怎么解决？

platonai commented 1 year ago

估计是遇到反爬了。Amazon.com 如果检测到一个全新的浏览器一上来就开始搜索，它就会认为该访问是爬虫。

解决方法：在 onBrowserLaunched 事件中访问 referer 页面，快速打开快速关闭即可，让 amazon.com 看到合理的访问轨迹。

        val hyperlink = ListenableHyperlink(url)
        val be = hyperlink.event.browseEvent
        be.onBrowserLaunched.addLast { page, driver ->
            val warmUpUrl = "https://www.amazon.com/"
            logger.info("Browser launched, warm up with url | {}", warmUpUrl)
            driver.navigateTo(warmUpUrl)
        }

platonai / exotic-amazon

Unable to get data when using a proxy #23