phetsims / aqua

Automatic QUality Assurance
MIT License
2 stars 4 forks source link

Run CT on Node20 #219

Closed zepumph closed 3 weeks ago

zepumph commented 1 month ago

Creating this issue since we kinda broke all of CT. Perhaps it is an se linux problem, and perhaps we should upgrade our Puppeteer version. I'll take a look.

CTQ is running correctly.

zepumph commented 1 month ago

Today's investigation about this was largely blocked by changes made from

https://github.com/phetsims/perennial/issues/386

and

https://github.com/phetsims/chipper/issues/1498

I am beginning to think that all the trouble we have had on main is from the registerTasks arg splitting. It is hard to test on servers, but I seem to get consistent args that look like (psuedo code) "node pm2/ProcessForwarder.js quick-server" where the ProcessForwarder knows how to splice in grunt for itself. I'll need to come back to this tomorrow.

zepumph commented 4 weeks ago

I made some progress today. I think there is a serious concern that something in our processes are taking up way too much processing. My theories:

  1. We are spawning many more child processes for grunt now, and this is either more overhead or leaking memory in an unknown way.
  2. There is a hidden-esque error occurring internally (maybe with puppeteer??), and the error handling is sub-par, causing a memory leak or infinite loop kind of repetition.

I'll need to come back to this.

zepumph commented 3 weeks ago

Some more discussion and summary with @samreid this morning:

problems:

  1. pm2 start ctq triggers forever restart + slack notification. This may be related to trying to start it while the server is overloaded from below (2)
  2. launch 100 puppeteer/firefox clients, === soooooooooooooooooooooooooo slow.
    • "->" indicated hypothesis with listed potential investigations and solutions.
    • -> gruntSpawn sub process is slow
      • Try using sage run, untested if this fixes the slow with 100 instances
        • 4|ct-puppe | 2024-10-30T20:18:49: Aborted due to warnings.
        • 4|ct-puppe | 2024-10-30T20:18:49: 2024-10-30T20:18:49: Warning: Task "../perennial/bin/sage" not found. Use --force to continue.
    • -> old puppeteer 19 + Node 20 = timeouts and other problems
      • 90|ct-node-puppeteer-client | 2024-10-30T16:22:45: error: FAILED TO RUN TEST, Tried to run 3 times, never completed, failure: Error: TimeoutError: Timed out after 30000 ms while trying to connect to the browser! Only Chrome at revision r1056772 is guaranteed to work.
      • 90|ct-node-puppeteer-client | browserPageLoad caught unexpected error: TimeoutError: Timed out after 30000 ms while trying to connect to the browser! Only Chrome at revision r1056772 is guaranteed to work.
    • -> ct-puppeteer-client has 502 on aquaserver/next-test -- We traced back to this failure on CT-main:
    • Even if the above solutions work, should we investigate why the problem is occurring to begin with? And if we should improve on the underlying change?
zepumph commented 3 weeks ago

Ok. I'm testing without cluster mode, and I think this error may help show the memory leak from puppeteer's side. Likely we still need to do https://github.com/phetsims/perennial/issues/393.

ct-chrom | 2024-10-31T17:59:03: (node:10112) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 101 SIGINT listeners added to [process]. MaxListeners is 100. Use emitter.setMaxListeners() to increase limit
2|ct-chrom | 2024-10-31T17:59:03: (node:10112) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 101 SIGTERM listeners added to [process]. MaxListeners is 100. Use emitter.setMaxListeners() to increase limit
2|ct-chrom | 2024-10-31T17:59:08: (node:10112) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 101 exit listeners added to [process]. MaxListeners is 100. Use emitter.setMaxListeners() to increase limit
2|ct-chrom | 2024-10-31T17:59:08: (node:10112) MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 101 SIGHUP listeners added to [process]. MaxListeners is 100. Use emitter.setMaxListeners() to increase limit
zepumph commented 3 weeks ago

I believe that all the trouble we have been encountering this week was because of a memory leak in browserPageLoad(). Fixed by https://github.com/phetsims/perennial/commit/f6a0b7a4a62383470b797d0391f0450b78aa6b94 above. CT is working well now.

zepumph commented 3 weeks ago

We will continue optimizing sparky tasks over in https://github.com/phetsims/aqua/issues/220

zepumph commented 3 weeks ago

Added a TODO here to test https://github.com/phetsims/perennial/issues/362

phet-dev commented 3 weeks ago

Reopening because there is a TODO marked for this issue.

zepumph commented 3 weeks ago

Re-closing for more testing.

phet-dev commented 3 weeks ago

Reopening because there is a TODO marked for this issue.

zepumph commented 3 weeks ago

Excellent!