gildas-lormeau / single-file-cli

CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
GNU Affero General Public License v3.0
540 stars 58 forks source link

Occasional stack traces from the CLI #37

Open andrewdbate opened 2 years ago

andrewdbate commented 2 years ago

I am running the following command from a Bash shell (MinGW on Windows 10):

docker run --mount "type=bind,src=$PWD/cookiedir,dst=/cookiedir" --mount "type=bind,src=$PWD/sitedir,dst=/sitedir" singlefile --browser-cookies-file=/cookiedir/cookies.txt --urls-file="/sitedir/urls.txt" --output-directory="/sitedir" --dump-content=false --filename-template="{url-pathname-flat}.html"

Note that I am using the Docker image and the --urls-file option.

Sometimes I get the following error:

Execution context was destroyed, most likely because of a navigation. URL: <redacted>
Stack: Error: Execution context was destroyed, most likely because of a navigation.
    at rewriteError (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:265:23)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async ExecutionContext._evaluateInternal (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:219:60)
    at async ExecutionContext.evaluate (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:110:16)
    at async getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:139:10)
    at async Object.exports.getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:51:10)
    at async capturePage (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:248:20)
    at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:169:20)
    at async Promise.all (index 0)

Sometimes I get the following different error:

Navigation failed because browser has disconnected! URL: <redacted>
Stack: Error: Navigation failed because browser has disconnected!
    at /usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/LifecycleWatcher.js:51:147
    at /usr/src/app/node_modules/puppeteer-core/lib/cjs/vendor/mitt/src/index.js:51:62
    at Array.map (<anonymous>)
    at Object.emit (/usr/src/app/node_modules/puppeteer-core/lib/cjs/vendor/mitt/src/index.js:51:43)
    at CDPSession.emit (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/EventEmitter.js:72:22)
    at CDPSession._onClosed (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:256:14)
    at Connection._onMessage (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:99:25)
    at WebSocket.<anonymous> (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/node/NodeWebSocketTransport.js:13:32)
    at WebSocket.onMessage (/usr/src/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:132:16)
    at WebSocket.emit (events.js:315:20)

I can download the pages at the URLs that failed by trying again. However, I would only usually expect to get a stack trace from an internal error (not a network connection error, or whatever might be the underlying cause here).

One difficulty I have is that there is no option to "resume" downloading pages should some pages fail to download. Utilities such as youtube-dl allow you to run them a second time to continue downloading files that were not downloaded in the previous run.

  1. It would be good if the above errors were more user friendly (or explained what to do to fix the problem).
  2. It would also be good if downloads from a list of URLs could be resumed if interrupted / incomplete (similar to youtube-dl for example).
  3. Finally, is it guaranteed that if there is an error, then no file will be produced (i.e. HTML files are only created after a successful download)? If partial files or zero-byte files can be left behind after an error, then one has to inspect the log to be sure that all pages have downloaded correctly (where youtube-dl will create .part files that are renamed only once the file is fully downloaded to avoid this problem and allow resuming of downloads.)

Many thanks!

andrewdbate commented 2 years ago

Here is another error that I sometimes get:

Protocol error: Connection closed. Most likely the page has been closed. URL: <redacted>
Stack: Error: Protocol error: Connection closed. Most likely the page has been closed.
    at Object.assert (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/assert.js:26:15)
    at Page.close (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Page.js:2069:21)
    at Object.exports.getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:54:15)
    at async capturePage (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:248:20)
    at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:169:20)
    at async Promise.all (index 0)
    at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:185:3)
    at async Promise.all (index 0)
    at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:185:3)
    at async Promise.all (index 0)

And another:

net::ERR_ABORTED at <redacted> URL: <redacted>
Stack: Error: net::ERR_ABORTED at <redacted>
    at navigate (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/FrameManager.js:116:23)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async FrameManager.navigateFrame (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/FrameManager.js:91:21)
    at async Frame.goto (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/FrameManager.js:417:16)
    at async Page.goto (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Page.js:1156:16)
    at async pageGoto (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:187:3)
    at async handleJSRedirect (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:164:3)
    at async getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:148:21)
    at async Object.exports.getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:51:10)

(I have had to redact the URL from the errors, but nothing else was changed.)

andrewdbate commented 2 years ago

Would it be possible have the SingleFile CLI automatically retry in case of an error?

All of these errors come from Puppeteer. Would it be more reliable to use --back-end=jsdom instead?

What are the advantages / disadvantages of using jsdom instead of Chrome with SingleFile?

gildas-lormeau commented 2 years ago

Because you don't mention them and out of curiosity, did you try to use the options --crawl-save-session, --crawl-load-session or --crawl-sync-session?

andrewdbate commented 2 years ago

I followed your other suggestion, which was to use the sitemap.xml so I do not need to use the crawl options.

(Also, to followup on my comments about jsdom: It doesn't seem to work as well as using Chrome. When I tried using jsdom to download various Wikipedia pages, the images were missing.)

gildas-lormeau commented 2 years ago

I recommend to use for example --crawl-save-session, it will allow you to identify which URL failed when processing multiple URLs.

The errors you see are related to puppeteer. You could use playwright as an alternative but you have to install it manually with NPM by running npm install playwright in the folder of SingleFile. Then, you can pass --back-end=playwright to use it.

andrewdbate commented 2 years ago

Here is another error I sometimes get:

Protocol error (Target.closeTarget): Target closed. URL: <redacted>
Stack: Error: Protocol error (Target.closeTarget): Target closed.
    at /usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:71:63
    at new Promise (<anonymous>)
    at Connection.send (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:70:16)
    at Page.close (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Page.js:2075:44)
    at Object.exports.getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:54:15)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async capturePage (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:248:20)
    at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:169:20)
    at async Promise.all (index 0)

However, the error I posted in an earlier comment which says Execution context was destroyed, most likely because of a navigation is by far the most frequently occurring.

As mentioned above, I have been able to download the page successfully by retrying, however, this requires manual intervention (although I am trying to use bash scripts where possible). Hence me asking whether SingleFile CLI could automatically retry in case of an error.

andrewdbate commented 2 years ago

(I think we commented at the same time.)

Is playwright more reliable in your experience?

Can I use --crawl-save-session even though I do not want to crawl? (I prefer to use the sitemap.xml now because I know it is accurate, i.e., it contains all pages that need to be downloaded.)

gildas-lormeau commented 2 years ago

I don't know if playwright is more reliable, I have not done any intensive testing. It's a very popular alternative to puppeteer though.

Crawling in SingleFile CLI means processing multiple URLs in a batch. The option --crawl-save-session should work if you use --urls-file for example.

Regarding the intermittent errors you're encountering, maybe SingleFile consumes too much CPU, did you try to set --max-parallel-workers to 2 for example? (it should not be higher than your number of logical CPU cores)

andrewdbate commented 2 years ago

I thought I'd try out the --crawl-save-session option with --urls-file to see how it works. I used an URLs file with ~3800 URLs.

I did run out of memory at some point:

<--- Last few GCs --->

[29912:0000024F43C1BF10]  4559331 ms: Scavenge (reduce) 4093.2 (4102.8) -> 4093.2 (4105.0) MB, 4.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559343 ms: Scavenge (reduce) 4093.9 (4108.0) -> 4093.7 (4108.8) MB, 5.6 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559401 ms: Scavenge (reduce) 4094.3 (4104.0) -> 4094.2 (4106.0) MB, 5.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure

<--- JS stacktrace --->

FATAL ERROR: MarkCompactCollector: young object promotion failed Allocation failed - JavaScript heap out of memory
 1: 00007FF7D1F3052F napi_wrap+109311
 2: 00007FF7D1ED5256 v8::internal::OrderedHashTable<v8::internal::OrderedHashSet,1>::NumberOfElementsOffset+33302
<rest of trace elided>

But I was able to resume the downloads with the --crawl-sync-session option to download all URLs. So that's good!

Since the crawl session file is modified during the crawl, what happens if we get a crash (like the one above)? Is the file modified in a crash proof way (i.e., it won't leave the file in an inconsistent state, e.g. not a valid json file)?

Also, is it guaranteed that if there is an error, then no HTML file will be created (i.e. HTML files are only created after a successful download, no partial or zero byte files)?

gildas-lormeau commented 2 years ago

I did run out of memory at some point:

<--- Last few GCs --->

[29912:0000024F43C1BF10]  4559331 ms: Scavenge (reduce) 4093.2 (4102.8) -> 4093.2 (4105.0) MB, 4.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559343 ms: Scavenge (reduce) 4093.9 (4108.0) -> 4093.7 (4108.8) MB, 5.6 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559401 ms: Scavenge (reduce) 4094.3 (4104.0) -> 4094.2 (4106.0) MB, 5.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure

<--- JS stacktrace --->

FATAL ERROR: MarkCompactCollector: young object promotion failed Allocation failed - JavaScript heap out of memory
 1: 00007FF7D1F3052F napi_wrap+109311
 2: 00007FF7D1ED5256 v8::internal::OrderedHashTable<v8::internal::OrderedHashSet,1>::NumberOfElementsOffset+33302
<rest of trace elided>

Was it the Node or Chrome processes?

But I was able to resume the downloads with the --crawl-sync-session option to download all URLs. So that's good!

Glad to hear it :)

Since the crawl session file is modified during the crawl, what happens if we get a crash (like the one above)? Is the file modified in a crash proof way (i.e., it won't leave the file in an inconsistent state, e.g. not a valid json file)?

I don't know yet, I need to read the doc. Edit: It's not documented, cf. https://nodejs.org/api/fs.html#fswritefilesyncfile-data-options

Also, is it guaranteed that if there is an error, then no HTML file will be created (i.e. HTML files are only created after a successful download, no partial or zero byte files)?

Yes. However, I cannot guarantee they will be complete for the same reason than the previous question.

andrewdbate commented 2 years ago

It was the Node processes that crashed. The full stack trace was:

<--- Last few GCs --->

[29912:0000024F43C1BF10]  4559331 ms: Scavenge (reduce) 4093.2 (4102.8) -> 4093.2 (4105.0) MB, 4.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559343 ms: Scavenge (reduce) 4093.9 (4108.0) -> 4093.7 (4108.8) MB, 5.6 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559401 ms: Scavenge (reduce) 4094.3 (4104.0) -> 4094.2 (4106.0) MB, 5.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure

<--- JS stacktrace --->

FATAL ERROR: MarkCompactCollector: young object promotion failed Allocation failed - JavaScript heap out of memory
 1: 00007FF7D1F3052F napi_wrap+109311
 2: 00007FF7D1ED5256 v8::internal::OrderedHashTable<v8::internal::OrderedHashSet,1>::NumberOfElementsOffset+33302
 3: 00007FF7D1ED6026 node::OnFatalError+294
 4: 00007FF7D27A163E v8::Isolate::ReportExternalAllocationLimitReached+94
 5: 00007FF7D27864BD v8::SharedArrayBuffer::Externalize+781
 6: 00007FF7D263094C v8::internal::Heap::EphemeronKeyWriteBarrierFromCode+1516
 7: 00007FF7D261B58B v8::internal::NativeContextInferrer::Infer+59243
 8: 00007FF7D2600ABF v8::internal::MarkingWorklists::SwitchToContextSlow+57327
 9: 00007FF7D261470B v8::internal::NativeContextInferrer::Infer+30955
10: 00007FF7D260B82D v8::internal::MarkCompactCollector::EnsureSweepingCompleted+6269
11: 00007FF7D261395E v8::internal::NativeContextInferrer::Infer+27454
12: 00007FF7D26178EB v8::internal::NativeContextInferrer::Infer+43723
13: 00007FF7D2621142 v8::internal::ItemParallelJob::Task::RunInternal+18
14: 00007FF7D26210D1 v8::internal::ItemParallelJob::Run+641
15: 00007FF7D25F49D3 v8::internal::MarkingWorklists::SwitchToContextSlow+7939
16: 00007FF7D260BCDC v8::internal::MarkCompactCollector::EnsureSweepingCompleted+7468
17: 00007FF7D260A524 v8::internal::MarkCompactCollector::EnsureSweepingCompleted+1396
18: 00007FF7D2608088 v8::internal::MarkingWorklists::SwitchToContextSlow+87480
19: 00007FF7D26366D1 v8::internal::Heap::LeftTrimFixedArray+929
20: 00007FF7D26387B5 v8::internal::Heap::PageFlagsAreConsistent+789
21: 00007FF7D262DA61 v8::internal::Heap::CollectGarbage+2033
22: 00007FF7D2634855 v8::internal::Heap::GlobalSizeOfObjects+229
23: 00007FF7D266EC9B v8::internal::StackGuard::HandleInterrupts+891
24: 00007FF7D237DB26 v8::internal::interpreter::JumpTableTargetOffsets::iterator::operator=+8182
25: 00007FF7D2829FED v8::internal::SetupIsolateDelegate::SetupHeap+463949
26: 00007FF7D280B393 v8::internal::SetupIsolateDelegate::SetupHeap+337907
27: 000001CD8DDD17CF
gildas-lormeau commented 2 years ago

Do you know if this memory leak error is more likely to occur when there are capture errors?

andrewdbate commented 2 years ago

I didn't see any errors printed to standard error or output before the stack trace from Node.

gildas-lormeau commented 2 years ago

I'll try to reproduce the issue, do you use puppeteer or playwright?

andrewdbate commented 2 years ago

I used the default of Puppeteer.