Closed benoit74 closed 3 months ago
Diving a bit deeper into the third execution failure above, I've found that it is probably only a slightly different occurrence of #266
Page crashed
Page crashed
crawl interupted by a limit
(we do not have the return code in the log but we have this zimit message which is displayed only when return code is 11.New Window Timed Out
message (never occurred before in this crawl session) and what looks like the final message Browser disconnected (crashed?), interrupting crawl
{"timestamp":"2024-02-17T04:16:43.909Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:44.358Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n at Array.map (<anonymous>)\n at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:44.371Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:44.729Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/"}}
{"timestamp":"2024-02-17T04:16:44.730Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":234,"total":1337,"pending":1,"failed":34,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-02-17T04:16:44.422Z\",\"url\":\"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/\",\"added\":\"2024-02-17T04:06:40.040Z\",\"depth\":2}"]}}
{"timestamp":"2024-02-17T04:16:45.061Z","logLevel":"warn","context":"general","message":"Link Extraction failed in frame","details":{"reason":{"name":"TargetCloseError"},"page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:45.151Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:16:45.499Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n at Array.map (<anonymous>)\n at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:16:45.500Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":2,"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:16:46.046Z","logLevel":"warn","context":"general","message":"Link Extraction failed in frame","details":{"reason":{"name":"TargetCloseError"},"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:17:05.550Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:17:05.550Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n at PageWorker.initPage (file:///app/util/worker.js:135:17)\n at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n at async PageWorker.run (file:///app/util/worker.js:227:7)\n at async Promise.allSettled (index 0)\n at async Crawler.crawl (file:///app/crawler.js:887:5)\n at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:17:06.051Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:17:26.053Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:17:26.054Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n at PageWorker.initPage (file:///app/util/worker.js:135:17)\n at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n at async PageWorker.run (file:///app/util/worker.js:227:7)\n at async Promise.allSettled (index 0)\n at async Crawler.crawl (file:///app/crawler.js:887:5)\n at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:17:26.554Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:17:46.556Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:17:46.556Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n at PageWorker.initPage (file:///app/util/worker.js:135:17)\n at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n at async PageWorker.run (file:///app/util/worker.js:227:7)\n at async Promise.allSettled (index 0)\n at async Crawler.crawl (file:///app/crawler.js:887:5)\n at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:17:47.057Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:18:07.057Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:18:07.058Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n at PageWorker.initPage (file:///app/util/worker.js:135:17)\n at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n at async PageWorker.run (file:///app/util/worker.js:227:7)\n at async Promise.allSettled (index 0)\n at async Crawler.crawl (file:///app/crawler.js:887:5)\n at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:18:07.557Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:18:09.084Z","logLevel":"error","context":"browser","message":"Browser disconnected (crashed?), interrupting crawl","details":{}}
{"timestamp":"2024-02-17T04:18:09.088Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"Protocol error (Network.setUserAgentOverride): Session closed. Most likely the page has been closed.","stack":"TargetCloseError: Protocol error (Network.setUserAgentOverride): Session closed. Most likely the page has been closed.\n at CDPSessionImpl.send (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:405:35)\n at NetworkManager.setUserAgent (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NetworkManager.js:144:28)\n at CDPPage.setUserAgent (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:451:50)\n at Browser.newWindowPageWithCDP (file:///app/util/browser.js:288:20)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async PageWorker.initPage (file:///app/util/worker.js:126:24)\n at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n at async PageWorker.run (file:///app/util/worker.js:227:7)\n at async Promise.allSettled (index 0)\n at async Crawler.crawl (file:///app/crawler.js:887:5)"}}
{"timestamp":"2024-02-17T04:18:09.088Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://solar.lowtechmagazine.com/2022/03/how-to-build-a-practical-household-bike-generator/"}}
{"timestamp":"2024-02-17T04:18:09.090Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0}}
{"timestamp":"2024-02-17T04:18:09.091Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":235,"total":1337,"pending":1,"failed":35,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-02-17T04:16:45.517Z\",\"url\":\"https://solar.lowtechmagazine.com/2022/03/how-to-build-a-practical-household-bike-generator/\",\"added\":\"2024-02-17T04:06:40.040Z\",\"depth\":2}"]}}
{"timestamp":"2024-02-17T04:18:09.112Z","logLevel":"info","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-02-17T04:18:09.153Z","logLevel":"info","context":"general","message":"Saving crawl state to: /output/.tmper_31529/collections/crawl-20240217040548965/crawls/crawl-20240217041809-5a6f97c85870.yaml","details":{}}
{"timestamp":"2024-02-17T04:18:09.176Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":236,"total":1337,"pending":0,"failed":35,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-02-17T04:18:09.178Z","logLevel":"info","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"timestamp":"2024-02-17T04:18:09.203Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-02-17T04:18:09.204Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}
Remember this server is a solar-powered installation with a backup battery. I noticed that our scrapes always seem to be running when it's around 25% battery (coincidence?). Maybe this is irrelevant, but it's not a fast server. Any way to slow the crawl down so it doesn't overwhelm the server? Not sure if this would help, but it's a guess...
Thank you for the suggestion, I had this in mind as well.
I doubt there is a link because here the logs says that the problem is not a timeout while loading the page (which would indicate an upstream server issue) but a page crash or a new window timeout, meaning an issue with the brave browser on the scraper.
I will keep this in mind anyway.
Looks like Browsertrix crawler 1.0 might have enhanced the situation
I confirm that issue seems to be gone, I ran the recipe many times recently without any problem. Last runs produced quite a smaller ZIM but there is no significant message in the logs about pages which might be missing (143 links are broken on the solar.lowtechmagazine.com domain, but the ones I randomly checked are really not available online, so this is "normal").
In addition to #266 issue which has often been encountered on solar.lowtechmagazine.com recipe, the situation became clear the situation is even worse with last ZIM update I tried to perform which failed in very various manners.
All executions mentioned below ran with
First execution (which ran on athena18) succeeded from Zimfarm perspective, but the ZIM was only 53.61 MB instead of about 300-310 MB.
Second execution (which ran on athena18) failed with a new occurrence of #266 (see https://github.com/openzim/zimit/issues/266#issuecomment-1951858320 for new details I found out).
Third execution (which ran on athena18) succeeded from Zimfarm perspective, but the ZIM was only 63.57 MB instead of about 300-310 MB.
Fourth execution (which ran on ondemand) succeeded to produce a ZIM of about 300-310 MB. We still have 1388 failed pages (out of 2779), but we do not have any "Page crashed" message in the log. Looking at few failed pages, it looks like real issues (invalid href links) in the source website.