platonai / exotic-amazon

A complete solution to crawl amazon at scale completely and accurately.
143 stars 46 forks source link

can not get data after timeout multi time #26

Open sskmtm opened 1 year ago

sskmtm commented 1 year ago

发现几种数据爬取失败时的日志,前2次失败了都会在几分钟后重试,第三次失败后,后面就不会重试了,也不会进入:isRelevant (true) -> onBeforeFilter -> onBeforeExtract -> extract -> onAfterExtract -> onAfterFilter 这个流程

请问: 1、失败三次就直接失败是框架的机制吗?还是说可以通过某些设置解决 2、有没有办法设置,或者代码操作的时候,让失败了还是可以进入:isRelevant (true) -> onBeforeFilter -> onBeforeExtract -> extract -> onAfterExtract -> onAfterFilter 这个流程,因为这样可以做一些后置(清除操作)处理

第一次失败: Timeout to wait for document ready after 60 round, retry is supposed ⚠ Privacy leak warning U for N got 1601 0 <- 0 in 1m8.826s Trying 2th 5m later

22:19:32.142 [-worker-14] WARN  a.p.p.p.b.emulator.BrowserEmulator - Timeout to wait for document ready after 60 round, retry is supposed | https://www.amazon.com/dp/B00HXGSBXC
22:19:32.294 [-worker-14] INFO  a.p.p.p.b.e.c.MultiPrivacyContextManager - ⚠ Privacy leak warning 1/8 | 15#15GsQSA107 | 2787. Retry(1601) rs: Timeout to wait for document ready, rsp: PRIVACY
22:19:32.301 [5-thread-1] INFO  a.p.p.p.b.e.c.MultiPrivacyContextManager - Privacy context is inactive, closing it | 32m58s | 103wEu5102 | 
22:19:32.303 [5-thread-1] INFO  a.p.p.p.b.e.c.BrowserPrivacyContext - Privacy context #103wEu5102 has lived for 32m58s | success: 70(0.04 pages/s) | small: 0(0.0%) | traffic: 0 B(0 B/s) | tasks: 70 total run: 70 | [106.32.14.101:4283 => 106.32.14.101](0/70/0s)[retired idle] (st, 2), (pg, 70)
22:19:32.303 [5-thread-1] INFO  a.p.p.p.b.e.context.WebDriverContext - All tasks return in 0 seconds | 103wEu51021
22:19:32.304 [5-thread-1] INFO  a.p.p.p.b.d.BrowserAccompaniedDriverPoolCloser - Closing browser & driver pool with HEADLESS mode | {pulsar_chrome, 106.32.14.101:4283 | /var/folders/vr/_8xgwfn14959gb617jpn7gv40000gp/T/pulsar-kust/context/cx.103wEu5102}
22:19:32.326 [-worker-14] INFO  a.p.p.c.component.LoadComponent.Task - 2787. 💔 ⚡ U for N got 1601 0 <- 0 in 1m8.826s, fc:1/1 Retry(1601) rs: Timeout to wait for document ready, rsp: CRAWL | 15GsQSA107 | https://www.amazon.com/dp/B00HXGSBXC -parse -refresh
22:19:32.398 [-worker-14] INFO  a.p.p.c.impl.StreamingCrawler.Task - 2787. 🤺 Trying 2th 5m later |  U for N got 1601 0 <- 0 in 1m8.826s, fc:1/1 Retry(1601) rs: Timeout to wait for document ready, rsp: CRAWL | 15GsQSA107 | https://www.amazon.com/dp/B00HXGSBXC

第二次失败: Page is ROBOT_CHECK ⚠ Privacy leak warning U for RT got 1601 0 <- 0 in 10.709s Trying 3th 7m later

22:24:44.338 [-worker-12] WARN  a.p.p.p.b.e.i.BrowserEmulatorImplBase - 2790. Page is ROBOT_CHECK(10.98 KiB) with [122.232.253.12:4245 => 122.232.253.12](0/0/24m18s)[ready] in amazon.com(0) | file:///var/folders/vr/_8xgwfn14959gb617jpn7gv40000gp/T/ln/1d326bbcba3ed428a4a1afd8dcd488fd.htm
22:24:44.446 [-worker-12] INFO  a.p.p.p.b.e.c.MultiPrivacyContextManager - ⚠ Privacy leak warning 1/8 | 16#16n2ckM108 | 2790. Retry(1601) rs: ROBOT_CHECK, rsp: PRIVACY
22:24:44.481 [-worker-12] INFO  a.p.p.c.component.LoadComponent.Task - 2790. 💔 🔃 U for RT got 1601 0 <- 0 in 10.709s, last fetched 5m12s ago, fc:2/2 Retry(1601) rs: ROBOT_CHECK, rsp: CRAWL | 16n2ckM108 | https://www.amazon.com/dp/B00HXGSBXC -parse
22:24:44.483 [-worker-12] INFO  a.p.p.c.impl.StreamingCrawler.Task - 2790. 🤺 Trying 3th 7m later |  U for RT got 1601 0 <- 0 in 10.709s, last fetched 5m12s ago, fc:2/2 Retry(1601) rs: ROBOT_CHECK, rsp: CRAWL | 16n2ckM108 | https://www.amazon.com/dp/B00HXGSBXC

第三次失败: Timeout to wait for document ready after 60 round, retry is supposed ⚠ Privacy leak warning U for RT got 1601 0 <- 0 in 1m0.988s Gone (unexpected)

22:32:46.141 [-worker-12] WARN  a.p.p.p.b.emulator.BrowserEmulator - Timeout to wait for document ready after 60 round, retry is supposed | https://www.amazon.com/dp/B00HXGSBXC
22:32:46.265 [-worker-12] INFO  a.p.p.p.b.e.c.MultiPrivacyContextManager - ⚠ Privacy leak warning 2/8 | 15#15GsQSA107 | 2793. Retry(1601) rs: Timeout to wait for document ready, rsp: PRIVACY
22:32:46.266 [-worker-12] INFO  a.p.p.c.component.LoadComponent.Task - 2793. 💔 🔃 U for RT got 1601 0 <- 0 in 1m0.988s, last fetched 8m1s ago, fc:3/3 Retry(1601) rs: Timeout to wait for document ready, rsp: CRAWL | 15GsQSA107 | https://www.amazon.com/dp/B00HXGSBXC -parse
22:32:46.267 [-worker-12] INFO  a.p.p.c.impl.StreamingCrawler.Task - 2793. Gone (unexpected) U for RT got 1601 0 <- 0 in 1m0.988s, last fetched 8m1s ago, fc:3/3 Retry(1601) rs: Timeout to wait for document ready, rsp: CRAWL | 15GsQSA107 | https://www.amazon.com/dp/B00HXGSBXC
platonai commented 1 year ago

1、失败三次就直接失败是框架的机制吗?还是说可以通过某些设置解决

是。 参看 LoadOptions.nMaxRetry

    /**
     * Retry to fetch at most n times, if page.fetchRetries > nMaxRetry,
     * the page is marked as gone and do not fetch it again until -refresh is set to clear page.fetchRetries
     * */
    @Parameter(names = ["-nmr", "-nMaxRetry", "--n-max-retry"],
        description = "Retry to fetch at most n times, if page.fetchRetries > nMaxRetry," +
                " the page is marked as gone and do not fetch it again until -refresh is set to clear page.fetchRetries")
    var nMaxRetry = 3

2、有没有办法设置,或者代码操作的时候,让失败了还是可以进入:isRelevant (true) -> onBeforeFilter -> onBeforeExtract -> extract -> onAfterExtract -> onAfterFilter 这个流程,因为这样可以做一些后置(清除操作)处理

  1. 失败的页面通常不需要其他后置处理。要么重新采集,要么忽略任务。
  2. 事件处理机制提供了丰富的事件处理点,可以用来在网页的生命周期中执行相关任务。参考:AdvancedAsinScraper.scrape()
sskmtm commented 1 year ago
  1. 事件处理机制提供了丰富的事件处理点,可以用来在网页的生命周期中执行相关任务。参考:AdvancedAsinScraper.scrape()

问一下在哪个事件下,能够查看到 page.crawlStatus.isGone == true ?

我试过了各种事件,都没有捕捉到(实际发生了) 特指三次重试的时候,没有捕捉到 Gone 的状态