platonai / exotic-amazon

A complete solution to crawl amazon at scale completely and accurately.
146 stars 47 forks source link

单一资源模式采集 amazon.com 出现 503 错误 #29

Open swlcyx opened 1 year ago

swlcyx commented 1 year ago

14:29:19.835 [r-worker-9] INFO a.p.p.c.component.LoadComponent.Task - 29745. 💔 ⚡ U for N got 1462 0 <- 0 in 1.292s, fc:1/1 Exception(1462) httpCode: 503 | 5ZwMFW35 | search_keywords | https://www.amazon.com/s?k=teething+toy+for+dogs&language=en_US&page=2 -expires PT24H -ignoreFailure -isResource -label search_keywords -parse -refresh -requireImages 50 -requireSize 3000000

用代理爬取会出现代理返回的过期时间是半个小时后,但是此代理爬取几分钟后会大概率出现被亚马逊识别后一直返回503页面,针对此情况会有60%左右的链接出现上面的info,想问下这部分503 的页面是被取消爬取了还是会进行重试,以及代理部分是否是等到代理过期才会换代理,还是代理多次返回503后就进行更换代理

swlcyx commented 1 year ago

org.jsoup.UncheckedIOException: java.net.SocketTimeoutException: Read timeout at org.jsoup.helper.HttpConnection$Response.prepareByteData(HttpConnection.java:977) at org.jsoup.helper.HttpConnection$Response.body(HttpConnection.java:986) at ai.platon.pulsar.protocol.browser.emulator.impl.InteractiveBrowserEmulator.loadResourceWithoutRendering(InteractiveBrowserEmulator.kt:248) at ai.platon.pulsar.protocol.browser.emulator.impl.InteractiveBrowserEmulator.access$loadResourceWithoutRendering(InteractiveBrowserEmulator.kt:37) at ai.platon.pulsar.protocol.browser.emulator.impl.InteractiveBrowserEmulator$loadResourceWithoutRendering$1.invokeSuspend(InteractiveBrowserEmulator.kt) at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106) at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:571) at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:750) at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:678) at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:665) Caused by: java.net.SocketTimeoutException: Read timeout at org.jsoup.internal.ConstrainableInputStream.read(ConstrainableInputStream.java:58) at java.base/java.io.FilterInputStream.read(FilterInputStream.java:107) at org.jsoup.internal.ConstrainableInputStream.readToByteBuffer(ConstrainableInputStream.java:87) at org.jsoup.helper.DataUtil.readToByteBuffer(DataUtil.java:250) at org.jsoup.helper.HttpConnection$Response.prepareByteData(HttpConnection.java:975) ... 10 common frames omitted

报了这个错导致无法使用代理

platonai commented 1 year ago

使用-resource参数将激活单一资源采集模式,这种模式仅适用于单一资源,譬如静态网页、Json文件或者API。

针对在 amazon 这样的成熟站点,我们并不建议使用单一资源模式,因此出现各种意料之外的问题属于意料之中。

具体参考:https://www.zhihu.com/answer/2738050570