phetsims / aqua

Automatic QUality Assurance
MIT License
2 stars 4 forks source link

CT: Error: Failed to launch the browser process! #150

Closed zepumph closed 1 year ago

zepumph commented 2 years ago

I think this has something to do with bayes and the number of puppeteer instances on it, but I'm not really sure. I'll have to take a look.

This looks to be exactly the same as https://github.com/puppeteer/puppeteer/issues/6757 (except docker). I'll see if there is something I need to do better about closing it down.

``` arithmetic : build Build failed with status code 1: Running "report-media" task Running "clean" task Running "build" task >> TypeScript compilation complete: 24665ms Building runnable repository (arithmetic, brands: phet, phet-io) Building brand: phet >> Webpack build complete: 6747ms >> Production minification complete: 35456ms (2241851 bytes) >> Debug minification complete: 0ms (20575992 bytes) Building brand: phet-io >> Webpack build complete: 5473ms >> Production minification complete: 30259ms (2275428 bytes) >> Debug minification complete: 29481ms (2586396 bytes) No migration rules found at ../chipper/dist/js/phet-io-sim-specific/repos/arithmetic/js/arithmetic-migration-rules.js, no rules to be bundled with phet-io.js. No client guides found at ../phet-io-sim-specific/repos/arithmetic/client-guide/, no guides being built. >> TypeScript compilation complete: 6355ms Fatal error: Perennial task failed: Error: Failed to launch the browser process! [0624/084852.594901:FATAL:zygote_host_impl_linux.cc(174)] Check failed: process.IsValid(). Failed to launch zygote process #0 0x558cadb82469 base::debug::CollectStackTrace() #1 0x558cadae8f33 base::debug::StackTrace::StackTrace() #2 0x558cadafbd50 logging::LogMessage::~LogMessage() #3 0x558cadafc90e logging::LogMessage::~LogMessage() #4 0x558cabbbcb08 content::ZygoteHostImpl::LaunchZygote() #5 0x558cad696a30 content::(anonymous namespace)::LaunchZygoteHelper() #6 0x558cab21b6cb content::ZygoteCommunication::Init() #7 0x558cab21bcd4 content::CreateGenericZygote() #8 0x558cad695e95 content::ContentMainRunnerImpl::Initialize() #9 0x558cad693eb9 content::RunContentProcess() #10 0x558cad69400e content::ContentMain() #11 0x558cad6eefca headless::(anonymous namespace)::RunContentMain() #12 0x558cad6eecd5 headless::HeadlessShellMain() #13 0x558caa2e0668 ChromeMain #14 0x7f55b3e79555 __libc_start_main #15 0x558caa2e04aa _start Received signal 6 #0 0x558cadb82469 base::debug::CollectStackTrace() #1 0x558cadae8f33 base::debug::StackTrace::StackTrace() #2 0x558cadb81f71 base::debug::(anonymous namespace)::StackDumpSignalHandler() #3 0x7f55b85f6630 (/usr/lib64/libpthread-2.17.so+0xf62f) #4 0x7f55b3e8d387 __GI_raise #5 0x7f55b3e8ea78 __GI_abort #6 0x558cadb811f5 base::debug::BreakDebuggerAsyncSafe() #7 0x558cadafc221 logging::LogMessage::~LogMessage() #8 0x558cadafc90e logging::LogMessage::~LogMessage() #9 0x558cabbbcb08 content::ZygoteHostImpl::LaunchZygote() #10 0x558cad696a30 content::(anonymous namespace)::LaunchZygoteHelper() #11 0x558cab21b6cb content::ZygoteCommunication::Init() #12 0x558cab21bcd4 content::CreateGenericZygote() #13 0x558cad695e95 content::ContentMainRunnerImpl::Initialize() #14 0x558cad693eb9 content::RunContentProcess() #15 0x558cad69400e content::ContentMain() #16 0x558cad6eefca headless::(anonymous namespace)::RunContentMain() #17 0x558cad6eecd5 headless::HeadlessShellMain() #18 0x558caa2e0668 ChromeMain #19 0x7f55b3e79555 __libc_start_main #20 0x558caa2e04aa _start r8: 0000000000000000 r9: 0000000000000300 r10: 0000000000000008 r11: 0000000000000206 r12: 000039d800300140 r13: 000039d800300158 r14: 000039d800300150 r15: 000039d800221400 di: 0000000000004fa6 si: 0000000000004fa6 bp: 00007ffc4a797e90 bx: 00007ffc4a7986d0 dx: 0000000000000006 ax: 0000000000000000 cx: ffffffffffffffff sp: 00007ffc4a797d58 ip: 00007f55b3e8d387 efl: 0000000000000206 cgf: aaaa000000000033 erf: 0000000000000000 trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000 [end of stack trace] TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md at onClose (/data/share/phet/continuous-testing/ct-snapshots/1656077734615/chipper/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:241:20) at ChildProcess. (/data/share/phet/continuous-testing/ct-snapshots/1656077734615/chipper/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:232:79) at ChildProcess.emit (events.js:327:22) at Process.ChildProcess._handle.onexit (internal/child_process.js:277:12) Full Error details: Error: Failed to launch the browser process! [0624/084852.594901:FATAL:zygote_host_impl_linux.cc(174)] Check failed: process.IsValid(). Failed to launch zygote process #0 0x558cadb82469 base::debug::CollectStackTrace() #1 0x558cadae8f33 base::debug::StackTrace::StackTrace() #2 0x558cadafbd50 logging::LogMessage::~LogMessage() #3 0x558cadafc90e logging::LogMessage::~LogMessage() #4 0x558cabbbcb08 content::ZygoteHostImpl::LaunchZygote() #5 0x558cad696a30 content::(anonymous namespace)::LaunchZygoteHelper() #6 0x558cab21b6cb content::ZygoteCommunication::Init() #7 0x558cab21bcd4 content::CreateGenericZygote() #8 0x558cad695e95 content::ContentMainRunnerImpl::Initialize() #9 0x558cad693eb9 content::RunContentProcess() #10 0x558cad69400e content::ContentMain() #11 0x558cad6eefca headless::(anonymous namespace)::RunContentMain() #12 0x558cad6eecd5 headless::HeadlessShellMain() #13 0x558caa2e0668 ChromeMain #14 0x7f55b3e79555 __libc_start_main #15 0x558caa2e04aa _start Received signal 6 #0 0x558cadb82469 base::debug::CollectStackTrace() #1 0x558cadae8f33 base::debug::StackTrace::StackTrace() #2 0x558cadb81f71 base::debug::(anonymous namespace)::StackDumpSignalHandler() #3 0x7f55b85f6630 (/usr/lib64/libpthread-2.17.so+0xf62f) #4 0x7f55b3e8d387 __GI_raise #5 0x7f55b3e8ea78 __GI_abort #6 0x558cadb811f5 base::debug::BreakDebuggerAsyncSafe() #7 0x558cadafc221 logging::LogMessage::~LogMessage() #8 0x558cadafc90e logging::LogMessage::~LogMessage() #9 0x558cabbbcb08 content::ZygoteHostImpl::LaunchZygote() #10 0x558cad696a30 content::(anonymous namespace)::LaunchZygoteHelper() #11 0x558cab21b6cb content::ZygoteCommunication::Init() #12 0x558cab21bcd4 content::CreateGenericZygote() #13 0x558cad695e95 content::ContentMainRunnerImpl::Initialize() #14 0x558cad693eb9 content::RunContentProcess() #15 0x558cad69400e content::ContentMain() #16 0x558cad6eefca headless::(anonymous namespace)::RunContentMain() #17 0x558cad6eecd5 headless::HeadlessShellMain() #18 0x558caa2e0668 ChromeMain #19 0x7f55b3e79555 __libc_start_main #20 0x558caa2e04aa _start r8: 0000000000000000 r9: 0000000000000300 r10: 0000000000000008 r11: 0000000000000206 r12: 000039d800300140 r13: 000039d800300158 r14: 000039d800300150 r15: 000039d800221400 di: 0000000000004fa6 si: 0000000000004fa6 bp: 00007ffc4a797e90 bx: 00007ffc4a7986d0 dx: 0000000000000006 ax: 0000000000000000 cx: ffffffffffffffff sp: 00007ffc4a797d58 ip: 00007f55b3e8d387 efl: 0000000000000206 cgf: aaaa000000000033 erf: 0000000000000000 trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000 [end of stack trace] TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md  Snapshot from 6/24/2022, 7:35:34 AM ```
zepumph commented 2 years ago

Likely https://github.com/puppeteer/puppeteer/issues/1981 is a byproduct of this error.

zepumph commented 2 years ago

I was also able to reproduce this just by going onto bayes and building arithmetic --brands=phet-io in the continuous-client repos. It worked when I stopped all pm2 processes. Something with continuous-client is not working well over a long period of time.

zepumph commented 2 years ago

I think @mattpen and I got to the bottom of this. The new puppeteer client seems to be orphaning chrome instances when it stops or crashes, and so this morning, when all pm2 processes were stopped, we found 575 chrome instances still running. I will work on better cleanup/error handling for the puppeteer client.

Also noting that once orphaned chrome instances were killed off, I was able to successfully build arithmetic in the same while while all pm2 processes were running.

KatieWoe commented 2 years ago

A number of repos are showing similar errors this morning (6/29/22) Example:

``` arithmetic : build Build failed with status code 1: Running "report-media" task Running "clean" task Running "build" task >> TypeScript compilation complete: 21831ms Building runnable repository (arithmetic, brands: phet, phet-io) Building brand: phet >> Webpack build complete: 6363ms >> Production minification complete: 33441ms (2241784 bytes) >> Debug minification complete: 1ms (20575041 bytes) Building brand: phet-io >> Webpack build complete: 4835ms >> Production minification complete: 33991ms (2275351 bytes) >> Debug minification complete: 39640ms (2586496 bytes) No migration rules found at ../chipper/dist/js/phet-io-sim-specific/repos/arithmetic/js/arithmetic-migration-rules.js, no rules to be bundled with phet-io.js. No client guides found at ../phet-io-sim-specific/repos/arithmetic/client-guide/, no guides being built. >> TypeScript compilation complete: 5083ms Fatal error: Perennial task failed: Error: Failed to launch the browser process! [0629/065748.409687:FATAL:zygote_host_impl_linux.cc(174)] Check failed: process.IsValid(). Failed to launch zygote process #0 0x5589efb53469 base::debug::CollectStackTrace() #1 0x5589efab9f33 base::debug::StackTrace::StackTrace() #2 0x5589efaccd50 logging::LogMessage::~LogMessage() #3 0x5589efacd90e logging::LogMessage::~LogMessage() #4 0x5589edb8db08 content::ZygoteHostImpl::LaunchZygote() #5 0x5589ef667a30 content::(anonymous namespace)::LaunchZygoteHelper() #6 0x5589ed1ec6cb content::ZygoteCommunication::Init() #7 0x5589ed1eccd4 content::CreateGenericZygote() #8 0x5589ef666e95 content::ContentMainRunnerImpl::Initialize() #9 0x5589ef664eb9 content::RunContentProcess() #10 0x5589ef66500e content::ContentMain() #11 0x5589ef6bffca headless::(anonymous namespace)::RunContentMain() #12 0x5589ef6bfcd5 headless::HeadlessShellMain() #13 0x5589ec2b1668 ChromeMain #14 0x7fbd63230555 __libc_start_main #15 0x5589ec2b14aa _start Received signal 6 #0 0x5589efb53469 base::debug::CollectStackTrace() #1 0x5589efab9f33 base::debug::StackTrace::StackTrace() #2 0x5589efb52f71 base::debug::(anonymous namespace)::StackDumpSignalHandler() #3 0x7fbd679ad630 (/usr/lib64/libpthread-2.17.so+0xf62f) #4 0x7fbd63244387 __GI_raise #5 0x7fbd63245a78 __GI_abort #6 0x5589efb521f5 base::debug::BreakDebuggerAsyncSafe() #7 0x5589efacd221 logging::LogMessage::~LogMessage() #8 0x5589efacd90e logging::LogMessage::~LogMessage() #9 0x5589edb8db08 content::ZygoteHostImpl::LaunchZygote() #10 0x5589ef667a30 content::(anonymous namespace)::LaunchZygoteHelper() #11 0x5589ed1ec6cb content::ZygoteCommunication::Init() #12 0x5589ed1eccd4 content::CreateGenericZygote() #13 0x5589ef666e95 content::ContentMainRunnerImpl::Initialize() #14 0x5589ef664eb9 content::RunContentProcess() #15 0x5589ef66500e content::ContentMain() #16 0x5589ef6bffca headless::(anonymous namespace)::RunContentMain() #17 0x5589ef6bfcd5 headless::HeadlessShellMain() #18 0x5589ec2b1668 ChromeMain #19 0x7fbd63230555 __libc_start_main #20 0x5589ec2b14aa _start r8: 0000000000000000 r9: 0000000000000300 r10: 0000000000000008 r11: 0000000000000206 r12: 0000187600300140 r13: 0000187600300158 r14: 0000187600300150 r15: 0000187600221400 di: 000000000000975f si: 000000000000975f bp: 00007ffc57156180 bx: 00007ffc571569c0 dx: 0000000000000006 ax: 0000000000000000 cx: ffffffffffffffff sp: 00007ffc57156048 ip: 00007fbd63244387 efl: 0000000000000206 cgf: aaaa000000000033 erf: 0000000000000000 trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000 [end of stack trace] TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md at onClose (/data/share/phet/continuous-testing/ct-snapshots/1656506053578/chipper/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:241:20) at ChildProcess. (/data/share/phet/continuous-testing/ct-snapshots/1656506053578/chipper/node_modules/puppeteer/lib/cjs/puppeteer/node/BrowserRunner.js:232:79) at ChildProcess.emit (events.js:327:22) at Process.ChildProcess._handle.onexit (internal/child_process.js:277:12) Full Error details: Error: Failed to launch the browser process! [0629/065748.409687:FATAL:zygote_host_impl_linux.cc(174)] Check failed: process.IsValid(). Failed to launch zygote process #0 0x5589efb53469 base::debug::CollectStackTrace() #1 0x5589efab9f33 base::debug::StackTrace::StackTrace() #2 0x5589efaccd50 logging::LogMessage::~LogMessage() #3 0x5589efacd90e logging::LogMessage::~LogMessage() #4 0x5589edb8db08 content::ZygoteHostImpl::LaunchZygote() #5 0x5589ef667a30 content::(anonymous namespace)::LaunchZygoteHelper() #6 0x5589ed1ec6cb content::ZygoteCommunication::Init() #7 0x5589ed1eccd4 content::CreateGenericZygote() #8 0x5589ef666e95 content::ContentMainRunnerImpl::Initialize() #9 0x5589ef664eb9 content::RunContentProcess() #10 0x5589ef66500e content::ContentMain() #11 0x5589ef6bffca headless::(anonymous namespace)::RunContentMain() #12 0x5589ef6bfcd5 headless::HeadlessShellMain() #13 0x5589ec2b1668 ChromeMain #14 0x7fbd63230555 __libc_start_main #15 0x5589ec2b14aa _start Received signal 6 #0 0x5589efb53469 base::debug::CollectStackTrace() #1 0x5589efab9f33 base::debug::StackTrace::StackTrace() #2 0x5589efb52f71 base::debug::(anonymous namespace)::StackDumpSignalHandler() #3 0x7fbd679ad630 (/usr/lib64/libpthread-2.17.so+0xf62f) #4 0x7fbd63244387 __GI_raise #5 0x7fbd63245a78 __GI_abort #6 0x5589efb521f5 base::debug::BreakDebuggerAsyncSafe() #7 0x5589efacd221 logging::LogMessage::~LogMessage() #8 0x5589efacd90e logging::LogMessage::~LogMessage() #9 0x5589edb8db08 content::ZygoteHostImpl::LaunchZygote() #10 0x5589ef667a30 content::(anonymous namespace)::LaunchZygoteHelper() #11 0x5589ed1ec6cb content::ZygoteCommunication::Init() #12 0x5589ed1eccd4 content::CreateGenericZygote() #13 0x5589ef666e95 content::ContentMainRunnerImpl::Initialize() #14 0x5589ef664eb9 content::RunContentProcess() #15 0x5589ef66500e content::ContentMain() #16 0x5589ef6bffca headless::(anonymous namespace)::RunContentMain() #17 0x5589ef6bfcd5 headless::HeadlessShellMain() #18 0x5589ec2b1668 ChromeMain #19 0x7fbd63230555 __libc_start_main #20 0x5589ec2b14aa _start r8: 0000000000000000 r9: 0000000000000300 r10: 0000000000000008 r11: 0000000000000206 r12: 0000187600300140 r13: 0000187600300158 r14: 0000187600300150 r15: 0000187600221400 di: 000000000000975f si: 000000000000975f bp: 00007ffc57156180 bx: 00007ffc571569c0 dx: 0000000000000006 ax: 0000000000000000 cx: ffffffffffffffff sp: 00007ffc57156048 ip: 00007fbd63244387 efl: 0000000000000206 cgf: aaaa000000000033 erf: 0000000000000000 trp: 0000000000000000 msk: 0000000000000000 cr2: 0000000000000000 [end of stack trace] TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md  Snapshot from 6/29/2022, 6:34:13 AM ```
zepumph commented 2 years ago

This morning there were 5200 instances of chrome running. I should look into this! I was hoping to have time this afternoon

zepumph commented 2 years ago

https://stackoverflow.com/questions/62220867/puppeteer-chromium-instances-remain-active-in-the-background-after-browser-disc has something I'd like to try, instead of using zygote (an internal process forker), lets have things forked via node. Worth a shot!

zepumph commented 2 years ago

This didn't seem to change anything, After 2 hours, there are double the number of chrome instances running. I'll have to keep looking.

samreid commented 2 years ago

I noticed that we are calling withServer again and again in a while (true) loop in QuickServer. Each time, this allocates a new node server:

    const server = http.createServer( ( req, res ) => {

We call server.close on it, but after reading https://stackoverflow.com/questions/14626636/how-do-i-shutdown-a-node-js-https-server-immediately it's unclear whether that's enough to clear that memory. @zepumph could there be a memory leak in these server instances?

zepumph commented 2 years ago

I just logged onto bayes and saw that there were 2000 instances of chrome running.

Then I ran pm2 stop continuous-client and checked and there were 500 chrome instances still running.

I then ran pm2 stop continuous-quick-server and checked and there were 0 chrome instances still running. Quite likely however we are running Puppeteer in continuous client is similarly problematic to the quick server. Thanks for the intel @samreid! It may have to do with withServer, but I think the most likely culprit is puppeteerLoad itself.

samreid commented 2 years ago

Can you reproduce this locally, or only on bayes? Do we think it is related to pm2 or linux? Do you think I could reproduce the problem on mac with or without pm2? I'm investigating playwright in a separate issue but would like a way to test how it impacts this problem (even if I have to test on bayes directly).

samreid commented 2 years ago

I logged in to bayes as phet-admin and saw pm2 list like so:

[phet-admin@bayes ~]$ pm2 list
┌─────┬────────────────────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐
│ id  │ name                       │ namespace   │ version │ mode    │ pid      │ uptime │ ↺    │ status    │ cpu      │ mem      │ user     │ watching │
├─────┼────────────────────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤
│ 4   │ continuous-client          │ default     │ N/A     │ fork    │ 20052    │ 11h    │ 1    │ online    │ 100%     │ 387.6mb  │ phe… │ disabled │
│ 3   │ continuous-quick-server    │ default     │ N/A     │ fork    │ 20049    │ 11h    │ 0    │ online    │ 0%       │ 102.7mb  │ phe… │ disabled │
│ 1   │ continuous-server          │ default     │ N/A     │ fork    │ 20043    │ 11h    │ 10   │ online    │ 0%       │ 346.3mb  │ phe… │ disabled │
│ 0   │ phettest-server            │ default     │ 0.0.0   │ fork    │ 20035    │ 11h    │ 2    │ online    │ 0%       │ 43.8mb   │ phe… │ disabled │
│ 2   │ yotta-server               │ default     │ 0.0.0   │ fork    │ 20068    │ 11h    │ 2    │ online    │ 0%       │ 37.1mb   │ phe… │ disabled │
└─────┴────────────────────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┴──────────┴──────────┴──────────┴──────────┘

that pgrep -c chrome reported 1579. Also, ps aux --no-heading showed many as defunct. Also there were many listings from perennial:

...
phet-ad+ 40211  0.5  0.0      0     0 ?        Zs   18:21   0:48 [chrome] <defunct>
phet-ad+ 40408  0.3  0.0      0     0 ?        Zs   15:19   1:00 [chrome] <defunct>
phet-ad+ 40414  0.0  0.0 17469332 57344 ?      Ssl  16:48   0:03 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40421  0.0  0.0 17299480 35760 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40423  0.0  0.0 17299464 35764 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40425  0.0  0.0 17299496 10972 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40442  0.0  0.0 17492024 40044 ?      Sl   16:48   0:04 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40445  0.0  0.0 25919264 35332 ?      Sl   16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40555  0.0  0.0 17332372 37372 ?      Sl   16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40727  0.0  0.0 17465132 57832 ?      Ssl  16:48   0:04 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40730  0.0  0.0 17299480 35760 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40732  0.0  0.0 17299464 35764 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40734  0.0  0.0 17299500 10976 ?      S    16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40748  0.0  0.0 17491828 31116 ?      Sl   16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40755  0.0  0.0 25927528 35576 ?      Sl   16:48   0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40909  0.6  0.0      0     0 ?        Zs   18:21   0:58 [chrome] <defunct>
phet-ad+ 40927  0.3  0.0      0     0 ?        Zs   18:21   0:30 [chrome] <defunct>
phet-ad+ 40942  0.6  0.0      0     0 ?        Zs   18:21   0:53 [chrome] <defunct>

I also saw:

[phet-admin@bayes ~]$ pgrep -c node
44

After stopping continuous-server, I saw:

[phet-admin@bayes ~]$ pm2 stop continuous-server
[PM2] Applying action stopProcessId on app [continuous-server](ids: [ 1 ])
[phet-admin@bayes ~]$ pgrep -c chrome
1582
[phet-admin@bayes ~]$ pgrep -c node
35

These numbers don't make sense, since the number of chromes went from 1579 to 1582.

Then I stopped quick-server:

[phet-admin@bayes ~]$ pm2 stop continuous-quick-server
[PM2] Applying action stopProcessId on app [continuous-quick-server](ids: [ 3 ])
...
[phet-admin@bayes ~]$ pgrep -c chrome
784
[phet-admin@bayes ~]$ pgrep -c node
33

So that reduced chromes from 1582 to 784 and nodes from 35 to 33. Still disconcerting that we still have 784 chromes running.

After restarting those pm2 services, and running pgrep -c chrome intermittently, I saw the number of chrome instances climb from 796 on startup to 811 after a few minutes.

samreid commented 2 years ago

I also saw https://github.com/puppeteer/puppeteer/issues/1825 which described defunct zombie processes like we see here.

I also saw https://askubuntu.com/questions/201303/what-is-a-defunct-process-and-why-doesnt-it-get-killed which describes defunct processes.

I used that process to find the PPID of the bad parent process, and I ran kill -9 20052 and saw that pgrep -c chrome dropped to 94. One other process accounted for the other defunct chromes. I killed that parent process too.

I checked that pm2 is still operational, but after 5 minutes, there are already 50 defunct chrome processes.

samreid commented 2 years ago

During my review of quick server, I noticed two potential problems:

By the way, bayes has climbed to 204 chrome processes (as reported by pgrep -c chrome) while I worked on this. UPDATE: 220

samreid commented 2 years ago

I stopped continuous-server and continuous-quick-server, pulled aqua and perennial in each, and restarted. The number of pgrep -c chrome chrome instances held steady at 208 for a while, but just jumped to 241. But only a few (if any) look like defunct zombies.

UPDATE: Holding steady at 241 for quite a while.

samreid commented 2 years ago

The perennial commit seems like a bugfix, but warrants a code review since it is used in many places, and one or more usage sites could have been relying on the bug somehow? I marked blocks-publication and ready for review for that part.

Also, we are still around 300 chrome instances, and not many look like zombies. So there is probably still a problem of some sort.

samreid commented 2 years ago

This morning (10 hours later than preceding comment), I observed:

[phet-admin@bayes ~]$ pgrep -c chrome
927

With around 600 marked as defunct according to ps -ef | grep defunct.

zepumph commented 2 years ago

Great fix! I was using this command to see how many chrome instances there were ps aux | grep chrome | wc -l.

We have to apply this to the continuous-client process too. I just did this. I bet that will solve the rest of the defunct guys.

zepumph commented 2 years ago

After pulling/restarting continuous-client, I see no defunct chrome instances. There were 307 upon restart, and are now 313. I believe this is solved, but we can check in tomorrow. Thanks so much @samreid for finding that bug!

zepumph commented 2 years ago

We are still hovering around 330 chrome instances. Closing

zepumph commented 1 year ago

In general things were working well, but when there were errors outside the expected places for the webpage, puppeteerLoad still wasn't cleaning up. Thus, above I added even more cleanup.

When looking at the error logs, I see the page.goto has caused a fair number of errors over the past months. I wonder if that contributed to the errors. I'll check into to see if that helps things soon.

zepumph commented 1 year ago

This happened again tonight, and my best guess is that it was because https://github.com/phetsims/aqua/commit/1de7fdf7b10ef32d41b96404201de183b37c6739 was launching too many chrome instances. I restarted things and pushed a fix over in https://github.com/phetsims/aqua/issues/152 and ps aux | grep chrome | wc -l is showing steadily with 88 chrome instances.

zepumph commented 1 year ago

After the weekend we are at 112 chrome instances. Perhaps there is still a slow problem. After stopping all CT processes, there are still 25 zombies. I will keep tabs on these.

zepumph commented 1 year ago

Now there are 184 chrome instances. That is still growing. I need to continue to work here.

zepumph commented 1 year ago

@mattpen and I spoke about this yesterday. We feel like it is important to improve the logging in our puppeteer processes, and then from there figure out where the error case is coming from. I'll take the lead!

zepumph commented 1 year ago

I think that https://github.com/phetsims/perennial/issues/291 could be helpful here, or at least something to note.

zepumph commented 1 year ago

We are running into this over in https://github.com/phetsims/special-ops/issues/234

zepumph commented 1 year ago

Most likely our sparky case will be solved by https://github.com/phetsims/aqua/issues/172.

zepumph commented 1 year ago

Ok. After some discussion with @mattpen, The above commit should handle the orphaned chrome instances that are much more rare from the build process. I'm ready to close this issue, but I'll reopen if we need to track down any more orphaned chrome instances.