Open timvandermeij opened 7 months ago
I've a wip locally to use Bidi with chrome. Last time I tried I got this bug: https://github.com/puppeteer/puppeteer/issues/12111 which is now fixed. I'll try to push it in the next days.
Nice; I hope it fixes this issue too then. If not, then https://github.com/puppeteer/puppeteer/issues/12111 is pretty clever because I hadn't thought of using the demo viewer to create a more isolated test case yet, so thanks for that idea! If needed we can then probably report the issue upstream as a reduced standalone script instead, which makes it easier to reproduce and debug.
@whimboo Is this perhaps something you could help out with? It could have surfaced a bug in the WebDriver BiDi implementation, but because I'm stuck on debugging this I also can't create a reduced test case to report to upstream...
I cannot really help with the Chrome implementation of WebDriver BiDi and what might be the culprit here. @orkon could you may have a look at it? It looks like it's something around mouse clicks but it's interesting that the referenced code works some lines above and then the same click doesn't work when run again.
@whimboo thanks for pinging, I can take a look
Failures:
1) FreeText Editor Freetext must stay focused after having been moved must keep the focus
Message:
Not run because a beforeAll function failed. The beforeAll failure will be reported on the suite that caused it.
Stack:
at <Jasmine>
Suite error: FreeText Editor Freetext must stay focused after having been moved
Message:
TypeError: Cannot read properties of undefined (reading 'newPage')
Stack:
at file:///Users/alexrudenko/src/pdf.js/test/integration/test_utils.mjs:22:42
at Array.map (<anonymous>)
at loadAndWait (file:///Users/alexrudenko/src/pdf.js/test/integration/test_utils.mjs:21:32)
at UserContext.<anonymous> (file:///Users/alexrudenko/src/pdf.js/test/integration/freetext_editor_spec.mjs:2308:21)
at <Jasmine>
Message:
TypeError: Cannot read properties of undefined (reading 'map')
Stack:
at closePages (file:///Users/alexrudenko/src/pdf.js/test/integration/test_utils.mjs:80:11)
at UserContext.<anonymous> (file:///Users/alexrudenko/src/pdf.js/test/integration/freetext_editor_spec.mjs:2312:13)
at <Jasmine>
Ran 1 of 178 specs
1 spec, 2 failures
Finished in 0.011 seconds
file:///Users/alexrudenko/src/pdf.js/test/webserver.mjs:63
this.server.close(callback);
^
TypeError: Cannot read properties of null (reading 'close')
at WebServer.stop (file:///Users/alexrudenko/src/pdf.js/test/webserver.mjs:63:17)
at stopServer (file:///Users/alexrudenko/src/pdf.js/test/test.mjs:1017:10)
at file:///Users/alexrudenko/src/pdf.js/test/test.mjs:766:5
at closeSession (file:///Users/alexrudenko/src/pdf.js/test/test.mjs:1040:28)
at async Promise.all (index 1)
at async startIntegrationTest (file:///Users/alexrudenko/src/pdf.js/test/test.mjs:821:3)
Node.js v20.10.0
[12:11:09] The following tasks did not complete: integrationtest, runIntegrationTest
[12:11:09] Did you forget to signal async completion?
file:///Users/alexrudenko/src/pdf.js/gulpfile.mjs:731
throw new Error(`Running ${testsName} tests failed.`);
^
Error: Running integration tests failed.
at ChildProcess.<anonymous> (file:///Users/alexrudenko/src/pdf.js/gulpfile.mjs:731:15)
at ChildProcess.emit (node:events:514:28)
at ChildProcess.emit (node:domain:551:15)
at maybeClose (node:internal/child_process:1105:16)
at ChildProcess._handle.onexit (node:internal/child_process:305:5)
at Process.callbackTrampoline (node:internal/async_hooks:130:17)
I get the above error. Could you perhaps share the Puppeteer logs for that test run?
Normally this test has been fixed in https://github.com/mozilla/pdf.js/pull/17962
Right, so the test passes for me on that branch. Is the OOM the only remaining issue?
I was able to run all integration tests locally on a Mac and I didn't see any extreme memory usage. I got two test failures but it looked like some wrong selectors? I would recommend setting protocolTimeout to some lower value like 10sec to catch any timeouts earlier if you do not expect to have long running calls.
Right, so the test passes for me on that branch. Is the OOM the only remaining issue?
Yes the last tries with the PR above gave us this OOM crash on the linux bot only. Yesterday we got:
file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:69
this._reject(callback, new TargetCloseError('Target closed'));
^
TargetCloseError: Protocol error (Runtime.callFunctionOn): Target closed
at CallbackRegistry.clear (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:69:36)
at CdpCDPSession._onClosed (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:98:25)
at Connection.onMessage (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:118:25)
at WebSocket.<anonymous> (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/node/NodeWebSocketTransport.js:36:32)
at callListener (C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\ws\lib\event-target.js:290:14)
at WebSocket.onMessage (C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\ws\lib\event-target.js:209:9)
at WebSocket.emit (node:events:513:28)
at Receiver.receiverOnMessage (C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\ws\lib\websocket.js:1209:20)
at Receiver.emit (node:events:513:28)
at C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\ws\lib\receiver.js:608:16 {
cause: ProtocolError
at <instance_members_initializer> (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:89:14)
at new Callback (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:93:16)
at CallbackRegistry.create (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:19:26)
at Connection._rawSend (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:77:26)
at CdpCDPSession.send (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:63:33)
at CDPClientAdapter.sendCommand (file:///C:/pdfjs/botio-files-pdfjs/private/92a94d4274292a5/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/BidiOverCdp.js:106:39)
at WorkerRealm.serializeCdpObject (C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\chromium-bidi\lib\cjs\bidiMapper\modules\script\Realm.js:152:47)
at C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\chromium-bidi\lib\cjs\bidiMapper\modules\log\LogManager.js:64:38
at Array.map (<anonymous>)
at C:\pdfjs\botio-files-pdfjs\private\92a94d4274292a5\node_modules\chromium-bidi\lib\cjs\bidiMapper\modules\log\LogManager.js:63:45
}
in the middle of the unit tests on the Windows 10 bot. For the timeouts, it's probably our fault and they can be fixed later.
Could you try it with the canary build (install npx puppeteer browsers install chrome@canary
and provide executable path)?
Same OOM crash with canary.
Which node version do you use?
18.16.0
weird, I am trying on a Linux with the same version and the node memory usage is low and stable (running both chrome and firefox tests using the chrome_bidi branch). I only get a couple of tests failed same as on Mac:
Failures:
1) FreeText Editor FreeText with popup must not remove an empty annotation
Message:
ProtocolError: script.callFunction timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed.
Stack:
at <instance_members_initializer> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:89:14)
at new Callback (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:93:16)
at CallbackRegistry.create (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:19:26)
at BidiConnection.send (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Connection.js:51:32)
at Session.send (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/core/Session.js:172:42)
at Session.<anonymous> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/util/decorators.js:94:27)
at WindowRealm.callFunction (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/core/Realm.js:92:51)
at WindowRealm.<anonymous> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/util/decorators.js:94:27)
at #evaluate (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Realm.js:126:42)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
2) FreeText Editor Freetext must stay focused after having been moved must keep the focus
Message:
ProtocolError: Waiting for selector `#pdfjs_internal_editor_0` failed: Protocol error (script.callFunction): unknown error Runtime.callFunctionOn timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed. ProtocolError: Runtime.callFunctionOn timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed.
at <instance_members_initializer> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:89:14)
at new Callback (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:93:16)
at CallbackRegistry.create (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:19:26)
at Connection._rawSend (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/Connection.js:77:26)
at CdpCDPSession.send (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/cdp/CDPSession.js:63:33)
at CDPClientAdapter.sendCommand (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/BidiOverCdp.js:106:39)
at WindowRealm.callFunction (/usr/local/google/home/alexrudenko/pdf.js/node_modules/chromium-bidi/lib/cjs/bidiMapper/modules/script/Realm.js:243:58)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
at async WindowRealm.callFunction (/usr/local/google/home/alexrudenko/pdf.js/node_modules/chromium-bidi/lib/cjs/bidiMapper/modules/script/WindowRealm.js:136:16)
at async ScriptProcessor.callFunction (/usr/local/google/home/alexrudenko/pdf.js/node_modules/chromium-bidi/lib/cjs/bidiMapper/modules/script/ScriptProcessor.js:59:16)
Stack:
at <instance_members_initializer> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:89:14)
at new Callback (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:93:16)
at CallbackRegistry.create (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/common/CallbackRegistry.js:19:26)
at BidiConnection.send (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Connection.js:51:32)
at Session.send (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/core/Session.js:172:42)
at Session.<anonymous> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/util/decorators.js:94:27)
at WindowRealm.callFunction (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/core/Realm.js:92:51)
at WindowRealm.<anonymous> (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/util/decorators.js:94:27)
at #evaluate (file:///usr/local/google/home/alexrudenko/pdf.js/node_modules/puppeteer-core/lib/esm/puppeteer/bidi/Realm.js:126:42)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
178 specs, 2 failures
Finished in 494.777 seconds
Run 178 tests
OHNOES! Some integration tests failed!
2 of 178 failed
integration tests runtime was 503.6 seconds
Actually, after I tried to run it one more time and left it unattended without moving the mouse, all tests passed.
I wanted to try on a Windows 10 machine but for me web.archive.org times out. Is there another way to fetch the test files?
never mind, it was a network issue that I managed to solve.
So I ran it on Windows 10 and I do not see excessive memory use. Node peaks at 400 MB and drops to 200 MB during the test run.
188 specs, 0 failures
Finished in 270.978 seconds
Run 188 tests
All integration tests passed.
integration tests runtime was 277.9 seconds
I have managed to reproduce the issue locally with Arch Linux, Node 20.14.0 LTS (same version as the bots) and running npx gulp botmakeref
. The integration tests work fine, but just as on the bots it's the browser tests where it fails. Interestingly the OOM killer already kicks in for Chrome without WebDriver BiDi, so simply with the current CDP version on master
already. The first PDFs work fine, but after around 100 PDFs memory usage suddenly spikes from around 500 MB to 2 GB and it doesn't get freed anymore.
Even more interesting is that Firefox also peaks to around 2 GB, but that actually gets released while in Chrome it doesn't get released anymore, so it does look like there is some sort of memory leak in either Chrome itself or in our test harness.
This is the spike that happens in both browsers ("Geheugen" = "Memory" here):
This is afterwards, where it's visible that Firefox ("Web Content" process) is restored but Chrome stays high in memory until the OOM killer shuts it down eventually:
I'm not really sure what to make of this yet, or how to debug further. It's at least an indication that this problem is not directly to WebDriver BiDi but it does somehow make it worse on the bots (I do recall having had OOM issues with Chrome on Linux before but I never really looked into it until now, so this might not be a new issue but rather one that surfaces now because of WebDriver BiDi somehow).
@timvandermeij interesting, I am trying to run it now and it looks like it runs in the headless mode: is there a way to make it headful or perhaps double check that tabs are getting closed properly? perhaps some tabs remain open for longer than needed, maybe blocked by some alerts or beforeunload handlers or something like that?
The comment https://github.com/mozilla/pdf.js/pull/17962#issuecomment-2063780941 looked like the OOM happened in the node process not able to get more than 4 GB of memory.
The comment https://github.com/mozilla/pdf.js/pull/17962#issuecomment-2063780941 looked like the OOM happened in the node process not able to get more than 4 GB of memory.
Good catch; I'm not sure what to make of that yet because Node.js itself shouldn't be doing much other than starting browsers through Puppeteer and kicking off the reference testing framework in said browsers. Locally I think it was in fact Chrome being killed by the OOM killer and not the Node.js process given the following dmesg
output:
[ 3463.243540] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service/app.slice/app-org.kde.konsole-a16b62a0fe444e609eaa56b9e3c418ce.scope,task=chrome,pid=29291,uid=1000
[ 3463.243612] Out of memory: Killed process 29291 (chrome) total-vm:1188349660kB, anon-rss:412752kB, file-rss:0kB, shmem-rss:57804kB, UID:1000 pgtables:3064kB oom_score_adj:300
[ 3712.746608] nouveau 0000:01:00.0: fifo: fault 00 [READ] at 00000000002de000 engine 00 [PGRAPH] client 10 [HUB/] reason 02 [PAGE_NOT_PRESENT] on channel 17 [003f600000 chrome[30321]]
[ 3712.746619] nouveau 0000:01:00.0: fifo:000000:0011:[chrome[30321]] rc scheduled
[ 3712.746622] nouveau 0000:01:00.0: fifo:000000: rc scheduled
[ 3712.746641] nouveau 0000:01:00.0: fifo:000000:0011:0011:[chrome[30321]] errored - disabling channel
However, I'm not sure if either one might be a red herring because I don't think it's guaranteed that the OOM killer necessarily kills the process that is using most memory: that can be configured, so it could be that also on the bots Chrome used most memory but the OOM killer decided to kill another process first.
it looks like it runs in the headless mode: is there a way to make it headful
The reference tests run differently compared to the integration tests. The integration tests work like one would expect from a Puppeteer point of view: we start two browsers and for each test open/close tabs. However for the reference tests we use Puppeteer only to spawn the browsers for us, but they always have only one tab that points to our reference test framework. That framework in turn loads each PDF file, compares/takes snapshots and closes said PDF files again. It doesn't do that by opening new tabs to load the PDF, but instead by creating a canvas, rendering the PDF onto it, writing the canvas to a file and destroying the canvas. That explains why, unlike the integration tests, we don't have to open/close tabs and there is therefore pretty much no usage of e.g. Puppeteer APIs.
Now that I'm writing this it makes it even more interesting that the reference tests are where it fails on the bots and locally, so that could be an indication it's somehow caused by the reference testing framework. We should be cleaning up immediately after snapshotting, and a quick look at the code here and there seems to confirm that, but we should probably look into that more to make sure there is not a defect in there.
Note that for the reference tests we only run Chrome on Linux now already because quite some time ago we had to disable Chrome on Windows for the reference tests only in #14509 because of excessive run times. I have seen that before the OOM killer kicks in that Chrome locally also slows down quite significantly, so maybe that slowdown was also a sign of the issue we experience here.
I can't link any of these events for sure and therefore can't rule anything out yet, but it at least it looks like there may be a relation here. I have some ideas already looking at the code, and I'll try to think of a good way to debug this further so we can hopefully at least figure out where the memory leak is coming from (our own code, Node.js, Chrome, Puppeteer, etc.). You did give me some new ideas/pointers so far, so thank you for that!
I have created a reduced version of the test framework/manifest to reproduce the issue much more easily. Steps to reproduce:
npx gulp botmakeref
.webDriverBiDi
to cdp
at https://github.com/mozilla/pdf.js/blob/master/test/test.mjs#L886.npx gulp botmakeref
.I'll check if I can reduce it even further, but in the meantime could you check if you can reproduce the issue at least with these steps? The only variable here is the change from CDP to WebDriver BiDi, so that at least shows an unexpected behavior change (even though the root cause might be elsewhere, such as a leak in PDF.js or the test framework, but even then CDP is able to somehow cope with it...).
@timvandermeij is the high memory usage coming from the browser or from Puppeteer process? Chrome, the WebDriver BiDi implementation runs in the Puppeteer process.
I think it's the Puppeteer process: the chrome
process remains relatively stable in terms of memory usage, but the node
process constantly increases in terms of memory usage with 100s of MBs since the start of the process. This would confirm what we spotted in https://github.com/mozilla/pdf.js/issues/17961#issuecomment-2152662126.
I seem to be able to reproduce the increases in the node process.
It peaked around 2.5 GB for me (intel mac).
I am gonna capture some memory snapshots.
Awesome; thank you! In the meantime I'll try to further reduce the reproducer, but fortunately we now at least have a somewhat reduced reproducer and we know it's most likely not the browser but rather something in the Node process, and that feels like progress on this issue :-)
Looks like the issue that PDF files (or something rather large?) is retained in WebDriver BiDi's network storage.
I think some of the things are also retained by the runner but they seem to be eventually released.
Right, those are most likely the snapshots. The browser loads each page of the PDF file, draws it on a canvas, takes a snapshot of the canvas and sends it from the browser to the test runner process (which can then e.g. save it to disk). This is done using network requests at https://github.com/mozilla/pdf.js/blob/master/test/driver.js#L1033-L1034, and the toDataUrl
call there produces such PNG image blobs.
I did try setting dataUrl = null
in the this_sendResult.then()
handler in the hope to free the image data sooner, but that didn't change anything (but if the requests are indeed retained by WebDriver BiDi then that would also make sense because the variable data would be wiped but not the request data).
In the reproducer this method is called 2500 times, and if each image is around 600 KB as indicated in the screenshot then that'd amount to around 1.5 GB (however some of it eventually seems to be freed indeed, which probably makes it a bit less in practice and that's what we have seen).
Version 22.10.1 contains the fix for the network request storage. Could you please trigger the tests using that version?
Wow, that's fast; thank you for the quick action! I'll try it out locally and prepare a patch to update our version of Puppeteer, and if that's successful we can retrigger the tests for the Chrome BiDi PR.
I have confirmed locally that the memory leak indeed no longer happens: the node
process now stays at around 100 MB and doesn't grow anymore with the reproducer, so that's great :tada:
I have created https://github.com/mozilla/pdf.js/pull/18231 to update Puppeteer and once that lands we can rebase the Chrome BiDi PR and retrigger the tests.
Good news: in #17962 the Linux bot is happy now, so no more OOM, and the runtime is pretty much equal as with CDP.
However, Windows is a different story: while the OOM problems are solved, it looks like WebDriver BiDi in combination with Chrome on Windows is quite a bit slower, causing the tests to eventually hit the 1-hour timeout mark. While they do eventually complete, we have also seen some strange logs in the output. We'll look into that further, and it might be a general Windows bot issue because the Windows bot has always been slower than the Linux bot already, but in any case: thanks so far @OrKoN for fixing the OOM issue!
@timvandermeij do you also experience the slowness locally?
Not on Linux, but I don't have a Windows machine to try it out on at the moment. We're currently testing it on the Windows bot to see if it can be reproduced and to exclude this being an issue with the bot itself somehow, and if that doesn't lead to insights then we can probably spin up a Windows VM or so to try to reproduce it.
can you see which tests are slow? as I understand the bot runs multiple kinds of tests in Chrome and Firefox.
I've a windows machine, I'll test it locally.
That said on the windows bot I noticed that a leftover Chrome for Testing
process in the task manager taking something like 40% of the cpu, so I killed it and re-ran the tests to see if it's still there after the tests ended (and maybe it was there before our last run and then it slowed down enough the machine to make the tests timed out).
Yes, that sounds likely, and looking at https://github.com/mozilla/pdf.js/pull/17962#issuecomment-2161290606 the tests now worked and within the expected runtime on Windows too. It might be that the left-over process was from one of the earlier runs. In any case, let's retry it a couple of times and check locally indeed, and hopefully it will keep working :-)
The last try failed on the windows bot and I've the impress that caused a left-over Chrome spinning process which caused subsequent failures because it uses too much CPU. Here's a screenshot showing how this process has been ran: I took a dump and thanks to windbg, I got this stack:
FAULTING_THREAD: 00000bb8
PROCESS_NAME: chrome.exe
ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached.
EXCEPTION_CODE_STR: 80000003
STACK_TEXT:
000000b4`f17fdb68 00007ff9`77c7e7d3 : 000019bc`04831520 00000233`b937a728 000019bc`0013da30 00007ff9`aa2506d7 : chrome!sqlite3_dbdata_init+0xf6ba00
000000b4`f17fdb70 00007ff9`77c7dc24 : 00000000`00000000 00000863`7bd05af7 00007ff9`79e2c130 000000b4`f17fded0 : chrome!ChromeMain+0xd7cd73
000000b4`f17fddf0 00007ff9`77c7d8ec : aaaaaaaa`aaaaaaaa 000019bc`000fc000 000000b4`f17fe158 000019bc`00487e30 : chrome!ChromeMain+0xd7c1c4
000000b4`f17fdf00 00007ff9`77c7d433 : aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa : chrome!ChromeMain+0xd7be8c
000000b4`f17fdfb0 00007ff9`785e00c3 : 0000a849`ab9a7fa3 000019bc`0003c418 aaaaaaaa`aaaaaa00 0000a849`ab9a7f23 : chrome!ChromeMain+0xd7b9d3
000000b4`f17fe1d0 00007ff9`791268b0 : aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa : chrome!GetHandleVerifier+0x141823
000000b4`f17fe220 00007ff9`7a00da01 : 000019bc`0001d160 000019bc`0013dbe0 00000000`00002710 00007ff9`7958486f : chrome!IsSandboxedProcess+0x81f010
000000b4`f17fe9d0 00007ff9`79d7028a : 00000000`00000032 00000000`00000008 000019bc`00053788 7fffffff`fffffff8 : chrome!IsSandboxedProcess+0x1706161
000000b4`f17fea70 00007ff9`76aa5802 : 00000080`74726174 00000000`00000000 00000000`00000000 00000000`00000120 : chrome!IsSandboxedProcess+0x14689ea
000000b4`f17feaf0 00007ff9`76f2ad9b : 000000b4`f17fefc0 000019bc`00050008 00000000`00000048 00007ff9`7a3156d1 : chrome+0xad5802
000000b4`f17fec20 00007ff9`76f062b7 : 000000b4`f17fefc0 00004510`003890a0 00000000`00000000 00007ff9`78d6134a : chrome!ChromeMain+0x2933b
000000b4`f17fee20 00007ff9`76f04f04 : 00000000`00000000 00007ff9`ad1c9174 00007ff9`aab50000 00000000`00000000 : chrome!ChromeMain+0x4857
000000b4`f17fefa0 00007ff9`76f046ae : 000000b4`f17ff1c0 00000000`00000001 000000b4`f17ff380 7fffffff`fffffff7 : chrome!ChromeMain+0x34a4
000000b4`f17ff130 00007ff9`76f01cfb : 00007ff6`44db0000 00004510`002643c0 000000b4`f17ff650 00007ff6`44faf2c8 : chrome!ChromeMain+0x2c4e
000000b4`f17ff360 00007ff6`44db9424 : 00007ff6`44db0000 00007ff9`76f01a60 00000000`00000008 00006944`002a8300 : chrome!ChromeMain+0x29b
000000b4`f17ff600 00007ff6`44db7e08 : 00000000`00000001 000000b4`f17ffbd0 000000b4`f17ff920 000000b4`f17ff930 : chrome_exe+0x9424
000000b4`f17ff880 00007ff6`44ed9da2 : 00007ff6`44f9c090 00007ff6`44ed9e19 00000000`00000000 00000000`00000000 : chrome_exe+0x7e08
000000b4`f17ffc90 00007ff9`aa9d84d4 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : chrome_exe!IsSandboxedProcess+0x95242
000000b4`f17ffcd0 00007ff9`ad171791 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : kernel32!BaseThreadInitThunk+0x14
000000b4`f17ffd00 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : ntdll!RtlUserThreadStart+0x21
When I opened the ProcExplorer I was expecting to see the chrome process as a child of node.js or something like that. I can share the dmp if someone knows how to extract something useful.
I recall there is an existing issue that crashpad handler is being launched on Windows unconditionally. I wonder if there is a crash during the test run? Also, I think the latest available version is 126, but it looks like the bot runs with 125.
Yep probably something went wrong (see the end of http://54.193.163.58:8877/a53e51e62057105/output.txt) but I think in general puppeteer shouldn't let those processes (especially if they're using some cpu) living. That said, is there a way to send the crash data and to have access to the crash report (which maybe could help us to figure out what's wrong) ? When we've a crash with Firefox, we send the data and we've a report, for example: https://crash-stats.mozilla.org/report/index/a61a50b5-ba99-4740-b515-3fefc0240614. As far as I know we never had this kind of issue with CDP but maybe I'm wrong, @timvandermeij do you remember something similar ?
Crash reporting is disabled for Chrome for Testing builds. You could enable crash reporting in regular Chrome (requires manually ticking the checkbox when the profile is created or via the browser settings). It looks like in this case the Node process is killed on uncaught exception and so it does not close the browser properly. It might be another bug in chromium-bidi. Apart from the crash reporting, setting dumpio: true should make the crashes visible in the stdout if there are any.
As far as I know we never had this kind of issue with CDP but maybe I'm wrong, @timvandermeij do you remember something similar ?
I don't recall having seen this particular issue with a left-over (active) browser process for CDP.
I'm reopening this because we had to revert to CDP for Chrome in #18506 because of sudden timeout/OOM issues during the browser tests on the Linux bot that we didn't see anymore after disabling WebDriver BiDi.
I have reconstructed the following series of events that led to this:
TEST-UNEXPECTED-FAIL | test failed chrome has not responded in 120s
. The second try worked.Note that using WebDriver BiDi for Firefox and CDP for Chrome also matches the upcoming default behavior in Puppeteer 23; see https://github.com/puppeteer/puppeteer/issues/12728 and https://github.com/puppeteer/puppeteer/pull/12732/files#diff-8b40d1cbf56e2b8f3e21919bcd5f815060634e9b604a5f026f452633ac2e2f3bR96. Given that upstream doesn't enable WebDriver BiDi by default for Chrome yet, but does for Firefox, it could perhaps be an indication that it's not fully ready yet.
Now that the Linux bot is stable again and the dependency updates are done we should find out why WebDriver BiDi suddenly caused issues for Chrome after two weeks without problems. It's important to note that we did try running the browser tests with the Puppeteer version we were already using before the issues occurred, which should use the same Chrome version because it's pinned by Puppeteer (in contrast to Firefox for which the latest version is fetched), but that didn't change anything. Given that reverting the Puppeteer update didn't change anything we're reasonably sure it's not a regression in Puppeteer, but it could be a regression in a browser binary because we had to purge the cache.
note that we do not turn on WebDriver BiDi by default for Chrome because CDP currently offers more Chrome-specific features that many users rely on. WebDriver BiDi for Chrome is considered ready in the scope of the functionality available via WebDriver BiDi.
It looks like the main problem is not the memory leak in chromium-bidi or Puppeteer but leaking browser processes. I think we do a very similar thing for WebDriver BiDi that we do for CDP when we close the browser so I think we would need to narrow the issue down. Is there a way for us to reproduce the issue on a smaller scale? As a workaround, we could try to identify browser processes after the test run and report/kill to make sure the next runs do not fail with OOM.
I looked at the code and I wonder if an error happens here https://github.com/mozilla/pdf.js/blob/master/test/test.mjs#L962 it does not look like the browser would be closed (probably, it is not the cause).
I have been looking at the integration test code and noticed the following code: https://github.com/mozilla/pdf.js/blob/master/test/test.mjs#L908-L910
Some time ago I tried using WebDriver BiDi for Chrome too and I recall it failed pretty soon back then. However, I tried again today with the current Puppeteer version and the results are much better now: only one test failure remains! Hence I figured now might be a good time to re-investigate WebDriver BiDi compatibility for Chrome, since it would be great if we could move away from CDP for all browsers we support.
I tried debugging this test,
FreeText Editor Freetext must stay focused after having been moved must keep the focus
, but I can't figure out why this works with CDP and fails with WebDriver BiDi. The traceback of the failing test is:Interestingly I found that the test passes with WebDriver BiDi if I comment out this: https://github.com/mozilla/pdf.js/blob/7290faf840ab5a3828d12bbbb460bd8ad49f6ffa/test/integration/freetext_editor_spec.mjs#L2362-L2366
I have enabled verbose logging using the Puppeteer debugging instructions from https://github.com/puppeteer/puppeteer/blob/ef8c4c808aa9dddd4d2501c8fc160cfcd0c4b9d1/docs/guides/debugging.md?plain=1#L131 but I can't really interpret the output because I don't know much about the WebDriver BiDi protocol and Puppeteer internals.
To reproduce the issue:
npm install
.cdp
towebDriverBiDi
in https://github.com/mozilla/pdf.js/blob/7290faf840ab5a3828d12bbbb460bd8ad49f6ffa/test/test.mjs#L881it
tofit
in https://github.com/mozilla/pdf.js/blob/7290faf840ab5a3828d12bbbb460bd8ad49f6ffa/test/integration/freetext_editor_spec.mjs#L2315 to only run this test.npx gulp integrationtest
and notice that the test hangs. If the protocol is changed to CDP the test passes. The verbose debugging logging can be obtained withenv DEBUG="puppeteer:*" npx gulp integrationtest
, for which it helps to change https://github.com/mozilla/pdf.js/blob/master/test/test.mjs#L961 to only run Chrome (otherwise Firefox also spams the logs).@whimboo Is this perhaps something you could help out with? It could have surfaced a bug in the WebDriver BiDi implementation, but because I'm stuck on debugging this I also can't create a reduced test case to report to upstream...