openzim / zimit

Make a ZIM file from any Web site and surf offline!
GNU General Public License v3.0
335 stars 24 forks source link

solar.lowtechmagazine.com is very unstable #283

Closed benoit74 closed 3 months ago

benoit74 commented 7 months ago

In addition to #266 issue which has often been encountered on solar.lowtechmagazine.com recipe, the situation became clear the situation is even worse with last ZIM update I tried to perform which failed in very various manners.

All executions mentioned below ran with

First execution (which ran on athena18) succeeded from Zimfarm perspective, but the ZIM was only 53.61 MB instead of about 300-310 MB.

Second execution (which ran on athena18) failed with a new occurrence of #266 (see https://github.com/openzim/zimit/issues/266#issuecomment-1951858320 for new details I found out).

Third execution (which ran on athena18) succeeded from Zimfarm perspective, but the ZIM was only 63.57 MB instead of about 300-310 MB.

Fourth execution (which ran on ondemand) succeeded to produce a ZIM of about 300-310 MB. We still have 1388 failed pages (out of 2779), but we do not have any "Page crashed" message in the log. Looking at few failed pages, it looks like real issues (invalid href links) in the source website.

benoit74 commented 7 months ago

Diving a bit deeper into the third execution failure above, I've found that it is probably only a slightly different occurrence of #266

{"timestamp":"2024-02-17T04:16:43.909Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:44.358Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n    at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n    at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n    at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n    at Array.map (<anonymous>)\n    at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n    at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n    at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:44.371Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0,"page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:44.729Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/"}}
{"timestamp":"2024-02-17T04:16:44.730Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":234,"total":1337,"pending":1,"failed":34,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-02-17T04:16:44.422Z\",\"url\":\"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/\",\"added\":\"2024-02-17T04:06:40.040Z\",\"depth\":2}"]}}
{"timestamp":"2024-02-17T04:16:45.061Z","logLevel":"warn","context":"general","message":"Link Extraction failed in frame","details":{"reason":{"name":"TargetCloseError"},"page":"https://solar.lowtechmagazine.com/2021/10/how-to-build-a-low-tech-solar-panel/","workerid":0}}
{"timestamp":"2024-02-17T04:16:45.151Z","logLevel":"info","context":"general","message":"Awaiting page load","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:16:45.499Z","logLevel":"error","context":"worker","message":"Page Crashed","details":{"type":"exception","message":"Page crashed!","stack":"Error: Page crashed!\n    at #onTargetCrashed (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:284:28)\n    at file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:153:41\n    at file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:248\n    at Array.map (<anonymous>)\n    at Object.emit (file:///app/node_modules/puppeteer-core/lib/esm/third_party/mitt/index.js:1:232)\n    at CDPSessionImpl.emit (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/EventEmitter.js:82:22)\n    at CDPSessionImpl._onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:425:18)\n    at Connection.onMessage (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:255:25)\n    at WebSocket.<anonymous> (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NodeWebSocketTransport.js:46:32)\n    at callListener (/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:290:14)","page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:16:45.500Z","logLevel":"info","context":"pageStatus","message":"Page Finished","details":{"loadState":2,"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:16:46.046Z","logLevel":"warn","context":"general","message":"Link Extraction failed in frame","details":{"reason":{"name":"TargetCloseError"},"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:17:05.550Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:17:05.550Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n    at PageWorker.initPage (file:///app/util/worker.js:135:17)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)\n    at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:17:06.051Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:17:26.053Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:17:26.054Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n    at PageWorker.initPage (file:///app/util/worker.js:135:17)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)\n    at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:17:26.554Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:17:46.556Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:17:46.556Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n    at PageWorker.initPage (file:///app/util/worker.js:135:17)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)\n    at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:17:47.057Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:18:07.057Z","logLevel":"error","context":"worker","message":"New Window Timed Out","details":{"seconds":20,"workerid":0}}
{"timestamp":"2024-02-17T04:18:07.058Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"timed out","stack":"Error: timed out\n    at PageWorker.initPage (file:///app/util/worker.js:135:17)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)\n    at async Crawler.run (file:///app/crawler.js:323:7)"}}
{"timestamp":"2024-02-17T04:18:07.557Z","logLevel":"warn","context":"worker","message":"Retrying getting new page","details":{"page":"https://solar.lowtechmagazine.com/2021/12/the-printed-website-volume-iii-and-more/","workerid":0}}
{"timestamp":"2024-02-17T04:18:09.084Z","logLevel":"error","context":"browser","message":"Browser disconnected (crashed?), interrupting crawl","details":{}}
{"timestamp":"2024-02-17T04:18:09.088Z","logLevel":"warn","context":"worker","message":"Error getting new page","details":{"workerid":0,"type":"exception","message":"Protocol error (Network.setUserAgentOverride): Session closed. Most likely the page has been closed.","stack":"TargetCloseError: Protocol error (Network.setUserAgentOverride): Session closed. Most likely the page has been closed.\n    at CDPSessionImpl.send (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Connection.js:405:35)\n    at NetworkManager.setUserAgent (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/NetworkManager.js:144:28)\n    at CDPPage.setUserAgent (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/common/Page.js:451:50)\n    at Browser.newWindowPageWithCDP (file:///app/util/browser.js:288:20)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async PageWorker.initPage (file:///app/util/worker.js:126:24)\n    at async PageWorker.runLoop (file:///app/util/worker.js:254:22)\n    at async PageWorker.run (file:///app/util/worker.js:227:7)\n    at async Promise.allSettled (index 0)\n    at async Crawler.crawl (file:///app/crawler.js:887:5)"}}
{"timestamp":"2024-02-17T04:18:09.088Z","logLevel":"info","context":"worker","message":"Starting page","details":{"workerid":0,"page":"https://solar.lowtechmagazine.com/2022/03/how-to-build-a-practical-household-bike-generator/"}}
{"timestamp":"2024-02-17T04:18:09.090Z","logLevel":"warn","context":"pageStatus","message":"Page Load Failed","details":{"loadState":0}}
{"timestamp":"2024-02-17T04:18:09.091Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":235,"total":1337,"pending":1,"failed":35,"limit":{"max":0,"hit":false},"pendingPages":["{\"seedId\":0,\"started\":\"2024-02-17T04:16:45.517Z\",\"url\":\"https://solar.lowtechmagazine.com/2022/03/how-to-build-a-practical-household-bike-generator/\",\"added\":\"2024-02-17T04:06:40.040Z\",\"depth\":2}"]}}
{"timestamp":"2024-02-17T04:18:09.112Z","logLevel":"info","context":"worker","message":"Worker exiting, all tasks complete","details":{"workerid":0}}
{"timestamp":"2024-02-17T04:18:09.153Z","logLevel":"info","context":"general","message":"Saving crawl state to: /output/.tmper_31529/collections/crawl-20240217040548965/crawls/crawl-20240217041809-5a6f97c85870.yaml","details":{}}
{"timestamp":"2024-02-17T04:18:09.176Z","logLevel":"info","context":"crawlStatus","message":"Crawl statistics","details":{"crawled":236,"total":1337,"pending":0,"failed":35,"limit":{"max":0,"hit":false},"pendingPages":[]}}
{"timestamp":"2024-02-17T04:18:09.178Z","logLevel":"info","context":"general","message":"Waiting to ensure pending data is written to WARCs...","details":{}}
{"timestamp":"2024-02-17T04:18:09.203Z","logLevel":"info","context":"general","message":"Crawling done","details":{}}
{"timestamp":"2024-02-17T04:18:09.204Z","logLevel":"info","context":"general","message":"Exiting, Crawl status: interrupted","details":{}}
Jaifroid commented 7 months ago

Remember this server is a solar-powered installation with a backup battery. I noticed that our scrapes always seem to be running when it's around 25% battery (coincidence?). Maybe this is irrelevant, but it's not a fast server. Any way to slow the crawl down so it doesn't overwhelm the server? Not sure if this would help, but it's a guess...

benoit74 commented 7 months ago

Thank you for the suggestion, I had this in mind as well.

I doubt there is a link because here the logs says that the problem is not a timeout while loading the page (which would indicate an upstream server issue) but a page crash or a new window timeout, meaning an issue with the brave browser on the scraper.

I will keep this in mind anyway.

benoit74 commented 4 months ago

Looks like Browsertrix crawler 1.0 might have enhanced the situation

benoit74 commented 3 months ago

I confirm that issue seems to be gone, I ran the recipe many times recently without any problem. Last runs produced quite a smaller ZIM but there is no significant message in the logs about pages which might be missing (143 links are broken on the solar.lowtechmagazine.com domain, but the ones I randomly checked are really not available online, so this is "normal").