Use WebDriver BiDi for Chrome

timvandermeij commented 2 months ago

I have been looking at the integration test code and noticed the following code: https://github.com/mozilla/pdf.js/blob/master/test/test.mjs#L908-L910

Some time ago I tried using WebDriver BiDi for Chrome too and I recall it failed pretty soon back then. However, I tried again today with the current Puppeteer version and the results are much better now: only one test failure remains! Hence I figured now might be a good time to re-investigate WebDriver BiDi compatibility for Chrome, since it would be great if we could move away from CDP for all browsers we support.

I tried debugging this test, FreeText Editor Freetext must stay focused after having been moved must keep the focus, but I can't figure out why this works with CDP and fails with WebDriver BiDi. The traceback of the failing test is:

Failures:
1) FreeText Editor Freetext must stay focused after having been moved must keep the focus
  Message:
    ProtocolError: Protocol error (script.callFunction): unknown error Runtime.callFunctionOn timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed. ProtocolError: Runtime.callFunctionOn timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed.
        at <instance_members_initializer> (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:89:14)
        at new Callback (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:93:16)
        at CallbackRegistry.create (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:19:26)
        at Connection._rawSend (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:77:26)
        at CdpCDPSession.send (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:63:33)
        at CDPClientAdapter.sendCommand (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/BidiOverCdp.js:106:39)
        at WindowRealm.callFunction (/home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/chromium-bidi/lib/cjs/bidiMapper/modules/script/Realm.js:243:58)
        at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
        at async WindowRealm.callFunction (/home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/chromium-bidi/lib/cjs/bidiMapper/modules/script/WindowRealm.js:136:16)
        at async ScriptProcessor.callFunction (/home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/chromium-bidi/lib/cjs/bidiMapper/modules/script/ScriptProcessor.js:59:16)
  Stack:
        at <instance_members_initializer> (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:89:14)
        at new Callback (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:93:16)
        at CallbackRegistry.create (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:19:26)
        at BidiConnection.send (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Connection.js:51:32)
        at Session.send (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/core/Session.js:172:42)
        at Session.<anonymous> (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/util/decorators.js:94:27)
        at WindowRealm.callFunction (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/core/Realm.js:92:51)
        at WindowRealm.<anonymous> (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/util/decorators.js:94:27)
        at #evaluate (file:///home/timvandermeij/Documenten/Ontwikkeling/pdf.js/Code/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Realm.js:126:42)
        at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Interestingly I found that the test passes with WebDriver BiDi if I comment out this: https://github.com/mozilla/pdf.js/blob/7290faf840ab5a3828d12bbbb460bd8ad49f6ffa/test/integration/freetext_editor_spec.mjs#L2362-L2366

I have enabled verbose logging using the Puppeteer debugging instructions from https://github.com/puppeteer/puppeteer/blob/ef8c4c808aa9dddd4d2501c8fc160cfcd0c4b9d1/docs/guides/debugging.md?plain=1#L131 but I can't really interpret the output because I don't know much about the WebDriver BiDi protocol and Puppeteer internals.

To reproduce the issue:

Clone the repository.
Run npm install.
Change cdp to webDriverBiDi in https://github.com/mozilla/pdf.js/blob/7290faf840ab5a3828d12bbbb460bd8ad49f6ffa/test/test.mjs#L881
Change it to fit in https://github.com/mozilla/pdf.js/blob/7290faf840ab5a3828d12bbbb460bd8ad49f6ffa/test/integration/freetext_editor_spec.mjs#L2315 to only run this test.
Run npx gulp integrationtest and notice that the test hangs. If the protocol is changed to CDP the test passes. The verbose debugging logging can be obtained with env DEBUG="puppeteer:*" npx gulp integrationtest, for which it helps to change https://github.com/mozilla/pdf.js/blob/master/test/test.mjs#L961 to only run Chrome (otherwise Firefox also spams the logs).

@whimboo Is this perhaps something you could help out with? It could have surfaced a bug in the WebDriver BiDi implementation, but because I'm stuck on debugging this I also can't create a reduced test case to report to upstream...

calixteman commented 2 months ago

I've a wip locally to use Bidi with chrome. Last time I tried I got this bug: https://github.com/puppeteer/puppeteer/issues/12111 which is now fixed. I'll try to push it in the next days.

timvandermeij commented 2 months ago

Nice; I hope it fixes this issue too then. If not, then https://github.com/puppeteer/puppeteer/issues/12111 is pretty clever because I hadn't thought of using the demo viewer to create a more isolated test case yet, so thanks for that idea! If needed we can then probably report the issue upstream as a reduced standalone script instead, which makes it easier to reproduce and debug.

whimboo commented 2 months ago

@whimboo Is this perhaps something you could help out with? It could have surfaced a bug in the WebDriver BiDi implementation, but because I'm stuck on debugging this I also can't create a reduced test case to report to upstream...

I cannot really help with the Chrome implementation of WebDriver BiDi and what might be the culprit here. @orkon could you may have a look at it? It looks like it's something around mouse clicks but it's interesting that the referenced code works some lines above and then the same click doesn't work when run again.

OrKoN commented 2 months ago

@whimboo thanks for pinging, I can take a look

OrKoN commented 2 months ago

Failures:
1) FreeText Editor Freetext must stay focused after having been moved must keep the focus
  Message:
    Not run because a beforeAll function failed. The beforeAll failure will be reported on the suite that caused it.
  Stack:
        at <Jasmine>

Suite error: FreeText Editor Freetext must stay focused after having been moved
  Message:
    TypeError: Cannot read properties of undefined (reading 'newPage')
  Stack:
        at file:///Users/alexrudenko/src/pdf.js/test/integration/test_utils.mjs:22:42
        at Array.map (<anonymous>)
        at loadAndWait (file:///Users/alexrudenko/src/pdf.js/test/integration/test_utils.mjs:21:32)
        at UserContext.<anonymous> (file:///Users/alexrudenko/src/pdf.js/test/integration/freetext_editor_spec.mjs:2308:21)
        at <Jasmine>
  Message:
    TypeError: Cannot read properties of undefined (reading 'map')
  Stack:
        at closePages (file:///Users/alexrudenko/src/pdf.js/test/integration/test_utils.mjs:80:11)
        at UserContext.<anonymous> (file:///Users/alexrudenko/src/pdf.js/test/integration/freetext_editor_spec.mjs:2312:13)
        at <Jasmine>

Ran 1 of 178 specs
1 spec, 2 failures
Finished in 0.011 seconds
file:///Users/alexrudenko/src/pdf.js/test/webserver.mjs:63
    this.server.close(callback);
                ^

TypeError: Cannot read properties of null (reading 'close')
    at WebServer.stop (file:///Users/alexrudenko/src/pdf.js/test/webserver.mjs:63:17)
    at stopServer (file:///Users/alexrudenko/src/pdf.js/test/test.mjs:1017:10)
    at file:///Users/alexrudenko/src/pdf.js/test/test.mjs:766:5
    at closeSession (file:///Users/alexrudenko/src/pdf.js/test/test.mjs:1040:28)
    at async Promise.all (index 1)
    at async startIntegrationTest (file:///Users/alexrudenko/src/pdf.js/test/test.mjs:821:3)

Node.js v20.10.0
[12:11:09] The following tasks did not complete: integrationtest, runIntegrationTest
[12:11:09] Did you forget to signal async completion?
file:///Users/alexrudenko/src/pdf.js/gulpfile.mjs:731
        throw new Error(`Running ${testsName} tests failed.`);
        ^

Error: Running integration tests failed.
    at ChildProcess.<anonymous> (file:///Users/alexrudenko/src/pdf.js/gulpfile.mjs:731:15)
    at ChildProcess.emit (node:events:514:28)
    at ChildProcess.emit (node:domain:551:15)
    at maybeClose (node:internal/child_process:1105:16)
    at ChildProcess._handle.onexit (node:internal/child_process:305:5)
    at Process.callbackTrampoline (node:internal/async_hooks:130:17)

I get the above error. Could you perhaps share the Puppeteer logs for that test run?

calixteman commented 2 months ago

Normally this test has been fixed in https://github.com/mozilla/pdf.js/pull/17962

OrKoN commented 2 months ago

Right, so the test passes for me on that branch. Is the OOM the only remaining issue?

OrKoN commented 2 months ago

I was able to run all integration tests locally on a Mac and I didn't see any extreme memory usage. I got two test failures but it looked like some wrong selectors? I would recommend setting protocolTimeout to some lower value like 10sec to catch any timeouts earlier if you do not expect to have long running calls.

calixteman commented 2 months ago

Right, so the test passes for me on that branch. Is the OOM the only remaining issue?

Yes the last tries with the PR above gave us this OOM crash on the linux bot only. Yesterday we got:

file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:69
            this._reject(callback, new TargetCloseError('Target closed'));
                                   ^

TargetCloseError: Protocol error (Runtime.callFunctionOn): Target closed
    at CallbackRegistry.clear (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:69:36)
    at CdpCDPSession._onClosed (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:98:25)
    at Connection.onMessage (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:118:25)
    at WebSocket.<anonymous> (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/node/NodeWebSocketTransport.js:36:32)
    at callListener (C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\ws\lib\event-target.js:290:14)
    at WebSocket.onMessage (C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\ws\lib\event-target.js:209:9)
    at WebSocket.emit (node:events:513:28)
    at Receiver.receiverOnMessage (C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\ws\lib\websocket.js:1209:20)
    at Receiver.emit (node:events:513:28)
    at C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\ws\lib\receiver.js:608:16 {
  cause: ProtocolError
      at <instance_members_initializer> (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:89:14)
      at new Callback (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:93:16)
      at CallbackRegistry.create (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:19:26)
      at Connection._rawSend (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:77:26)
      at CdpCDPSession.send (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:63:33)
      at CDPClientAdapter.sendCommand (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/BidiOverCdp.js:106:39)
      at WorkerRealm.serializeCdpObject (C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\chromium-bidi\lib\cjs\bidiMapper\modules\script\Realm.js:152:47)
      at C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\chromium-bidi\lib\cjs\bidiMapper\modules\log\LogManager.js:64:38
      at Array.map (<anonymous>)
      at C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\chromium-bidi\lib\cjs\bidiMapper\modules\log\LogManager.js:63:45
}

in the middle of the unit tests on the Windows 10 bot. For the timeouts, it's probably our fault and they can be fixed later.

OrKoN commented 2 months ago

Could you try it with the canary build (install npx puppeteer browsers install chrome@canary and provide executable path)?

calixteman commented 2 months ago

Done: https://github.com/mozilla/pdf.js/pull/17962#issuecomment-2063836024

calixteman commented 2 months ago

Same OOM crash with canary.

OrKoN commented 2 months ago

Which node version do you use?

calixteman commented 2 months ago

18.16.0

OrKoN commented 2 months ago

weird, I am trying on a Linux with the same version and the node memory usage is low and stable (running both chrome and firefox tests using the chrome_bidi branch). I only get a couple of tests failed same as on Mac:

Failures:
1) FreeText Editor FreeText with popup must not remove an empty annotation
  Message:
    ProtocolError: script.callFunction timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed.
  Stack:
        at <instance_members_initializer> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:89:14)
        at new Callback (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:93:16)
        at CallbackRegistry.create (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:19:26)
        at BidiConnection.send (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Connection.js:51:32)
        at Session.send (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/core/Session.js:172:42)
        at Session.<anonymous> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/util/decorators.js:94:27)
        at WindowRealm.callFunction (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/core/Realm.js:92:51)
        at WindowRealm.<anonymous> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/util/decorators.js:94:27)
        at #evaluate (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Realm.js:126:42)
        at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

2) FreeText Editor Freetext must stay focused after having been moved must keep the focus
  Message:
    ProtocolError: Waiting for selector `#pdfjs_internal_editor_0` failed: Protocol error (script.callFunction): unknown error Runtime.callFunctionOn timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed. ProtocolError: Runtime.callFunctionOn timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed.
        at <instance_members_initializer> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:89:14)
        at new Callback (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:93:16)
        at CallbackRegistry.create (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:19:26)
        at Connection._rawSend (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:77:26)
        at CdpCDPSession.send (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:63:33)
        at CDPClientAdapter.sendCommand (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/BidiOverCdp.js:106:39)
        at WindowRealm.callFunction (/usr/local/google/home/alexrudenko/pdf.js/node_modules/chromium-bidi/lib/cjs/bidiMapper/modules/script/Realm.js:243:58)
        at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
        at async WindowRealm.callFunction (/usr/local/google/home/alexrudenko/pdf.js/node_modules/chromium-bidi/lib/cjs/bidiMapper/modules/script/WindowRealm.js:136:16)
        at async ScriptProcessor.callFunction (/usr/local/google/home/alexrudenko/pdf.js/node_modules/chromium-bidi/lib/cjs/bidiMapper/modules/script/ScriptProcessor.js:59:16)
  Stack:
        at <instance_members_initializer> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:89:14)
        at new Callback (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:93:16)
        at CallbackRegistry.create (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:19:26)
        at BidiConnection.send (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Connection.js:51:32)
        at Session.send (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/core/Session.js:172:42)
        at Session.<anonymous> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/util/decorators.js:94:27)
        at WindowRealm.callFunction (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/core/Realm.js:92:51)
        at WindowRealm.<anonymous> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/util/decorators.js:94:27)
        at #evaluate (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Realm.js:126:42)
        at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

178 specs, 2 failures
Finished in 494.777 seconds

Run 178 tests
OHNOES!  Some integration tests failed!
  2 of 178 failed
integration tests runtime was 503.6 seconds

OrKoN commented 2 months ago

Actually, after I tried to run it one more time and left it unattended without moving the mouse, all tests passed.

OrKoN commented 4 weeks ago

I wanted to try on a Windows 10 machine but for me web.archive.org times out. Is there another way to fetch the test files?

OrKoN commented 4 weeks ago

never mind, it was a network issue that I managed to solve.

OrKoN commented 4 weeks ago

So I ran it on Windows 10 and I do not see excessive memory use. Node peaks at 400 MB and drops to 200 MB during the test run.

188 specs, 0 failures
Finished in 270.978 seconds

Run 188 tests
All integration tests passed.
integration tests runtime was 277.9 seconds

timvandermeij commented 4 weeks ago

I have managed to reproduce the issue locally with Arch Linux, Node 20.14.0 LTS (same version as the bots) and running npx gulp botmakeref. The integration tests work fine, but just as on the bots it's the browser tests where it fails. Interestingly the OOM killer already kicks in for Chrome without WebDriver BiDi, so simply with the current CDP version on master already. The first PDFs work fine, but after around 100 PDFs memory usage suddenly spikes from around 500 MB to 2 GB and it doesn't get freed anymore.

Even more interesting is that Firefox also peaks to around 2 GB, but that actually gets released while in Chrome it doesn't get released anymore, so it does look like there is some sort of memory leak in either Chrome itself or in our test harness.

This is the spike that happens in both browsers ("Geheugen" = "Memory" here): Spike

This is afterwards, where it's visible that Firefox ("Web Content" process) is restored but Chrome stays high in memory until the OOM killer shuts it down eventually: Firefox restored, Chrome not

I'm not really sure what to make of this yet, or how to debug further. It's at least an indication that this problem is not directly to WebDriver BiDi but it does somehow make it worse on the bots (I do recall having had OOM issues with Chrome on Linux before but I never really looked into it until now, so this might not be a new issue but rather one that surfaces now because of WebDriver BiDi somehow).

OrKoN commented 4 weeks ago

@timvandermeij interesting, I am trying to run it now and it looks like it runs in the headless mode: is there a way to make it headful or perhaps double check that tabs are getting closed properly? perhaps some tabs remain open for longer than needed, maybe blocked by some alerts or beforeunload handlers or something like that?

OrKoN commented 4 weeks ago

The comment https://github.com/mozilla/pdf.js/pull/17962#issuecomment-2063780941 looked like the OOM happened in the node process not able to get more than 4 GB of memory.

timvandermeij commented 4 weeks ago

The comment https://github.com/mozilla/pdf.js/pull/17962#issuecomment-2063780941 looked like the OOM happened in the node process not able to get more than 4 GB of memory.

Good catch; I'm not sure what to make of that yet because Node.js itself shouldn't be doing much other than starting browsers through Puppeteer and kicking off the reference testing framework in said browsers. Locally I think it was in fact Chrome being killed by the OOM killer and not the Node.js process given the following dmesg output:

[ 3463.243540] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.kde.konsole-a16b62a0fe444e609eaa56b9e3c418ce.scope,task=chrome,pid=29291,uid=1000
[ 3463.243612] Out of memory: Killed process 29291 (chrome) total-vm:1188349660kB, anon-rss:412752kB, file-rss:0kB, shmem-rss:57804kB, UID:1000 pgtables:3064kB oom_score_adj:300
[ 3712.746608] nouveau 0000:01:00.0: fifo: fault 00 [READ] at 00000000002de000 engine 00 [PGRAPH] client 10 [HUB/] reason 02 [PAGE_NOT_PRESENT] on channel 17 [003f600000 chrome[30321]]
[ 3712.746619] nouveau 0000:01:00.0: fifo:000000:0011:[chrome[30321]] rc scheduled
[ 3712.746622] nouveau 0000:01:00.0: fifo:000000: rc scheduled
[ 3712.746641] nouveau 0000:01:00.0: fifo:000000:0011:0011:[chrome[30321]] errored - disabling channel

However, I'm not sure if either one might be a red herring because I don't think it's guaranteed that the OOM killer necessarily kills the process that is using most memory: that can be configured, so it could be that also on the bots Chrome used most memory but the OOM killer decided to kill another process first.

it looks like it runs in the headless mode: is there a way to make it headful

The reference tests run differently compared to the integration tests. The integration tests work like one would expect from a Puppeteer point of view: we start two browsers and for each test open/close tabs. However for the reference tests we use Puppeteer only to spawn the browsers for us, but they always have only one tab that points to our reference test framework. That framework in turn loads each PDF file, compares/takes snapshots and closes said PDF files again. It doesn't do that by opening new tabs to load the PDF, but instead by creating a canvas, rendering the PDF onto it, writing the canvas to a file and destroying the canvas. That explains why, unlike the integration tests, we don't have to open/close tabs and there is therefore pretty much no usage of e.g. Puppeteer APIs.

Now that I'm writing this it makes it even more interesting that the reference tests are where it fails on the bots and locally, so that could be an indication it's somehow caused by the reference testing framework. We should be cleaning up immediately after snapshotting, and a quick look at the code here and there seems to confirm that, but we should probably look into that more to make sure there is not a defect in there.

Note that for the reference tests we only run Chrome on Linux now already because quite some time ago we had to disable Chrome on Windows for the reference tests only in #14509 because of excessive run times. I have seen that before the OOM killer kicks in that Chrome locally also slows down quite significantly, so maybe that slowdown was also a sign of the issue we experience here.

I can't link any of these events for sure and therefore can't rule anything out yet, but it at least it looks like there may be a relation here. I have some ideas already looking at the code, and I'll try to think of a good way to debug this further so we can hopefully at least figure out where the memory leak is coming from (our own code, Node.js, Chrome, Puppeteer, etc.). You did give me some new ideas/pointers so far, so thank you for that!

timvandermeij commented 3 weeks ago

I have created a reduced version of the test framework/manifest to reproduce the issue much more easily. Steps to reproduce:

Clone https://github.com/timvandermeij/pdf.js/tree/reduce-test.
Run npx gulp botmakeref.
Notice that memory grows with about 1 GB from start to end of the process. (On my system it went from 4.1 GB to 5.2 GB.)
Change webDriverBiDi to cdp at https://github.com/mozilla/pdf.js/blob/master/test/test.mjs#L886.
Run npx gulp botmakeref.
Notice that memory stays fairly consistent from start to end of the process. (On my system it stayed at around 4.2 GB.)

I'll check if I can reduce it even further, but in the meantime could you check if you can reproduce the issue at least with these steps? The only variable here is the change from CDP to WebDriver BiDi, so that at least shows an unexpected behavior change (even though the root cause might be elsewhere, such as a leak in PDF.js or the test framework, but even then CDP is able to somehow cope with it...).

OrKoN commented 3 weeks ago

@timvandermeij is the high memory usage coming from the browser or from Puppeteer process? Chrome, the WebDriver BiDi implementation runs in the Puppeteer process.

timvandermeij commented 3 weeks ago

I think it's the Puppeteer process: the chrome process remains relatively stable in terms of memory usage, but the node process constantly increases in terms of memory usage with 100s of MBs since the start of the process. This would confirm what we spotted in https://github.com/mozilla/pdf.js/issues/17961#issuecomment-2152662126.

OrKoN commented 3 weeks ago

I seem to be able to reproduce the increases in the node process.

OrKoN commented 3 weeks ago

It peaked around 2.5 GB for me (intel mac).

OrKoN commented 3 weeks ago

I am gonna capture some memory snapshots.

timvandermeij commented 3 weeks ago

Awesome; thank you! In the meantime I'll try to further reduce the reproducer, but fortunately we now at least have a somewhat reduced reproducer and we know it's most likely not the browser but rather something in the Node process, and that feels like progress on this issue :-)

OrKoN commented 3 weeks ago

Looks like the issue that PDF files (or something rather large?) is retained in WebDriver BiDi's network storage.

OrKoN commented 3 weeks ago

I think some of the things are also retained by the runner but they seem to be eventually released.

timvandermeij commented 3 weeks ago

Right, those are most likely the snapshots. The browser loads each page of the PDF file, draws it on a canvas, takes a snapshot of the canvas and sends it from the browser to the test runner process (which can then e.g. save it to disk). This is done using network requests at https://github.com/mozilla/pdf.js/blob/master/test/driver.js#L1033-L1034, and the toDataUrl call there produces such PNG image blobs.

I did try setting dataUrl = null in the this_sendResult.then() handler in the hope to free the image data sooner, but that didn't change anything (but if the requests are indeed retained by WebDriver BiDi then that would also make sense because the variable data would be wiped but not the request data).

In the reproducer this method is called 2500 times, and if each image is around 600 KB as indicated in the screenshot then that'd amount to around 1.5 GB (however some of it eventually seems to be freed indeed, which probably makes it a bit less in practice and that's what we have seen).

OrKoN commented 3 weeks ago

Version 22.10.1 contains the fix for the network request storage. Could you please trigger the tests using that version?

timvandermeij commented 3 weeks ago

Wow, that's fast; thank you for the quick action! I'll try it out locally and prepare a patch to update our version of Puppeteer, and if that's successful we can retrigger the tests for the Chrome BiDi PR.

timvandermeij commented 3 weeks ago

I have confirmed locally that the memory leak indeed no longer happens: the node process now stays at around 100 MB and doesn't grow anymore with the reproducer, so that's great :tada:

I have created https://github.com/mozilla/pdf.js/pull/18231 to update Puppeteer and once that lands we can rebase the Chrome BiDi PR and retrigger the tests.

timvandermeij commented 3 weeks ago

Good news: in #17962 the Linux bot is happy now, so no more OOM, and the runtime is pretty much equal as with CDP.

However, Windows is a different story: while the OOM problems are solved, it looks like WebDriver BiDi in combination with Chrome on Windows is quite a bit slower, causing the tests to eventually hit the 1-hour timeout mark. While they do eventually complete, we have also seen some strange logs in the output. We'll look into that further, and it might be a general Windows bot issue because the Windows bot has always been slower than the Linux bot already, but in any case: thanks so far @OrKoN for fixing the OOM issue!

OrKoN commented 3 weeks ago

@timvandermeij do you also experience the slowness locally?

timvandermeij commented 3 weeks ago

Not on Linux, but I don't have a Windows machine to try it out on at the moment. We're currently testing it on the Windows bot to see if it can be reproduced and to exclude this being an issue with the bot itself somehow, and if that doesn't lead to insights then we can probably spin up a Windows VM or so to try to reproduce it.

OrKoN commented 3 weeks ago

can you see which tests are slow? as I understand the bot runs multiple kinds of tests in Chrome and Firefox.

calixteman commented 3 weeks ago

I've a windows machine, I'll test it locally. That said on the windows bot I noticed that a leftover Chrome for Testing process in the task manager taking something like 40% of the cpu, so I killed it and re-ran the tests to see if it's still there after the tests ended (and maybe it was there before our last run and then it slowed down enough the machine to make the tests timed out).

timvandermeij commented 3 weeks ago

Yes, that sounds likely, and looking at https://github.com/mozilla/pdf.js/pull/17962#issuecomment-2161290606 the tests now worked and within the expected runtime on Windows too. It might be that the left-over process was from one of the earlier runs. In any case, let's retry it a couple of times and check locally indeed, and hopefully it will keep working :-)

calixteman commented 3 weeks ago

The last try failed on the windows bot and I've the impress that caused a left-over Chrome spinning process which caused subsequent failures because it uses too much CPU. Here's a screenshot showing how this process has been ran: I took a dump and thanks to windbg, I got this stack:

FAULTING_THREAD:  00000bb8

PROCESS_NAME:  chrome.exe

ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION}  Breakpoint  A breakpoint has been reached.

EXCEPTION_CODE_STR:  80000003

STACK_TEXT:  
000000b4`f17fdb68 00007ff9`77c7e7d3     : 000019bc`04831520 00000233`b937a728 000019bc`0013da30 00007ff9`aa2506d7 : chrome!sqlite3_dbdata_init+0xf6ba00
000000b4`f17fdb70 00007ff9`77c7dc24     : 00000000`00000000 00000863`7bd05af7 00007ff9`79e2c130 000000b4`f17fded0 : chrome!ChromeMain+0xd7cd73
000000b4`f17fddf0 00007ff9`77c7d8ec     : aaaaaaaa`aaaaaaaa 000019bc`000fc000 000000b4`f17fe158 000019bc`00487e30 : chrome!ChromeMain+0xd7c1c4
000000b4`f17fdf00 00007ff9`77c7d433     : aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa : chrome!ChromeMain+0xd7be8c
000000b4`f17fdfb0 00007ff9`785e00c3     : 0000a849`ab9a7fa3 000019bc`0003c418 aaaaaaaa`aaaaaa00 0000a849`ab9a7f23 : chrome!ChromeMain+0xd7b9d3
000000b4`f17fe1d0 00007ff9`791268b0     : aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa : chrome!GetHandleVerifier+0x141823
000000b4`f17fe220 00007ff9`7a00da01     : 000019bc`0001d160 000019bc`0013dbe0 00000000`00002710 00007ff9`7958486f : chrome!IsSandboxedProcess+0x81f010
000000b4`f17fe9d0 00007ff9`79d7028a     : 00000000`00000032 00000000`00000008 000019bc`00053788 7fffffff`fffffff8 : chrome!IsSandboxedProcess+0x1706161
000000b4`f17fea70 00007ff9`76aa5802     : 00000080`74726174 00000000`00000000 00000000`00000000 00000000`00000120 : chrome!IsSandboxedProcess+0x14689ea
000000b4`f17feaf0 00007ff9`76f2ad9b     : 000000b4`f17fefc0 000019bc`00050008 00000000`00000048 00007ff9`7a3156d1 : chrome+0xad5802
000000b4`f17fec20 00007ff9`76f062b7     : 000000b4`f17fefc0 00004510`003890a0 00000000`00000000 00007ff9`78d6134a : chrome!ChromeMain+0x2933b
000000b4`f17fee20 00007ff9`76f04f04     : 00000000`00000000 00007ff9`ad1c9174 00007ff9`aab50000 00000000`00000000 : chrome!ChromeMain+0x4857
000000b4`f17fefa0 00007ff9`76f046ae     : 000000b4`f17ff1c0 00000000`00000001 000000b4`f17ff380 7fffffff`fffffff7 : chrome!ChromeMain+0x34a4
000000b4`f17ff130 00007ff9`76f01cfb     : 00007ff6`44db0000 00004510`002643c0 000000b4`f17ff650 00007ff6`44faf2c8 : chrome!ChromeMain+0x2c4e
000000b4`f17ff360 00007ff6`44db9424     : 00007ff6`44db0000 00007ff9`76f01a60 00000000`00000008 00006944`002a8300 : chrome!ChromeMain+0x29b
000000b4`f17ff600 00007ff6`44db7e08     : 00000000`00000001 000000b4`f17ffbd0 000000b4`f17ff920 000000b4`f17ff930 : chrome_exe+0x9424
000000b4`f17ff880 00007ff6`44ed9da2     : 00007ff6`44f9c090 00007ff6`44ed9e19 00000000`00000000 00000000`00000000 : chrome_exe+0x7e08
000000b4`f17ffc90 00007ff9`aa9d84d4     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : chrome_exe!IsSandboxedProcess+0x95242
000000b4`f17ffcd0 00007ff9`ad171791     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x14
000000b4`f17ffd00 00000000`00000000     : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21

When I opened the ProcExplorer I was expecting to see the chrome process as a child of node.js or something like that. I can share the dmp if someone knows how to extract something useful.

OrKoN commented 3 weeks ago

I recall there is an existing issue that crashpad handler is being launched on Windows unconditionally. I wonder if there is a crash during the test run? Also, I think the latest available version is 126, but it looks like the bot runs with 125.

calixteman commented 3 weeks ago

Yep probably something went wrong (see the end of http://54.193.163.58:8877/a53e51e62057105/output.txt) but I think in general puppeteer shouldn't let those processes (especially if they're using some cpu) living. That said, is there a way to send the crash data and to have access to the crash report (which maybe could help us to figure out what's wrong) ? When we've a crash with Firefox, we send the data and we've a report, for example: https://crash-stats.mozilla.org/report/index/a61a50b5-ba99-4740-b515-3fefc0240614. As far as I know we never had this kind of issue with CDP but maybe I'm wrong, @timvandermeij do you remember something similar ?

OrKoN commented 3 weeks ago

Crash reporting is disabled for Chrome for Testing builds. You could enable crash reporting in regular Chrome (requires manually ticking the checkbox when the profile is created or via the browser settings). It looks like in this case the Node process is killed on uncaught exception and so it does not close the browser properly. It might be another bug in chromium-bidi. Apart from the crash reporting, setting dumpio: true should make the crashes visible in the stdout if there are any.

timvandermeij commented 3 weeks ago

As far as I know we never had this kind of issue with CDP but maybe I'm wrong, @timvandermeij do you remember something similar ?

I don't recall having seen this particular issue with a left-over (active) browser process for CDP.

mozilla / pdf.js

Use WebDriver BiDi for Chrome #17961