gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
GNU Affero General Public License v3.0
540 stars 58 forks source link

Timeout error on crawl #6

Closed skanga closed 1 year ago

skanga commented 1 year ago

I started a new crawl using single-file. I fetched 10 pages successfully and then gave me this error.

Timed out after 60000 ms URL: https://xxx
Stack: ScriptTimeoutError: Timed out after 60000 ms
    at Object.throwDecodedError (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\error.js:522:15)
    at parseHttpResponse (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\http.js:549:13)
    at Executor.execute (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\http.js:475:28)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async thenableWebDriverProxy.execute (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\node_modules\selenium-webdriver\lib\webdriver.js:735:17)
    at async getPageData (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\back-ends\webdriver-gecko.js:141:17)
    at async Object.exports.getPageData (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\back-ends\webdriver-gecko.js:37:10)
    at async capturePage (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:253:20)
    at async runNextTask (C:\Users\SKANGA\AppData\Roaming\npm\node_modules\single-file-cli\single-file-cli-api.js:174:20)

It it from the remote website? Perhaps I am crawling too fast? Is it possible to delay requests by some random time, etc? Also - if I restart the same crawl - can I get single-file to ignore the pages that it has already downloaded?

gildas-lormeau commented 1 year ago

It's due to webdriver which is not really designed for that. I would recommend to use puppeteer instead (playwright is also supported but you have to install it via npm).

skanga commented 1 year ago

OK, I'll try those two if I get this ...