Closed zepumph closed 1 year ago
Likely https://github.com/puppeteer/puppeteer/issues/1981 is a byproduct of this error.
I was also able to reproduce this just by going onto bayes and building arithmetic --brands=phet-io
in the continuous-client repos. It worked when I stopped all pm2 processes. Something with continuous-client is not working well over a long period of time.
I think @mattpen and I got to the bottom of this. The new puppeteer client seems to be orphaning chrome instances when it stops or crashes, and so this morning, when all pm2 processes were stopped, we found 575 chrome instances still running. I will work on better cleanup/error handling for the puppeteer client.
Also noting that once orphaned chrome instances were killed off, I was able to successfully build arithmetic in the same while while all pm2 processes were running.
A number of repos are showing similar errors this morning (6/29/22) Example:
This morning there were 5200 instances of chrome running. I should look into this! I was hoping to have time this afternoon
https://stackoverflow.com/questions/62220867/puppeteer-chromium-instances-remain-active-in-the-background-after-browser-disc has something I'd like to try, instead of using zygote (an internal process forker), lets have things forked via node. Worth a shot!
This didn't seem to change anything, After 2 hours, there are double the number of chrome instances running. I'll have to keep looking.
I noticed that we are calling withServer
again and again in a while (true)
loop in QuickServer. Each time, this allocates a new node server:
const server = http.createServer( ( req, res ) => {
We call server.close
on it, but after reading https://stackoverflow.com/questions/14626636/how-do-i-shutdown-a-node-js-https-server-immediately it's unclear whether that's enough to clear that memory. @zepumph could there be a memory leak in these server instances?
I just logged onto bayes and saw that there were 2000 instances of chrome running.
Then I ran pm2 stop continuous-client
and checked and there were 500 chrome instances still running.
I then ran pm2 stop continuous-quick-server
and checked and there were 0 chrome instances still running. Quite likely however we are running Puppeteer in continuous client is similarly problematic to the quick server. Thanks for the intel @samreid! It may have to do with withServer
, but I think the most likely culprit is puppeteerLoad
itself.
Can you reproduce this locally, or only on bayes? Do we think it is related to pm2 or linux? Do you think I could reproduce the problem on mac with or without pm2? I'm investigating playwright in a separate issue but would like a way to test how it impacts this problem (even if I have to test on bayes directly).
I logged in to bayes as phet-admin and saw pm2 list like so:
[phet-admin@bayes ~]$ pm2 list
┌─────┬────────────────────────────┬─────────────┬─────────┬─────────┬──────────┬────────┬──────┬───────────┬──────────┬──────────┬──────────┬──────────┐
│ id │ name │ namespace │ version │ mode │ pid │ uptime │ ↺ │ status │ cpu │ mem │ user │ watching │
├─────┼────────────────────────────┼─────────────┼─────────┼─────────┼──────────┼────────┼──────┼───────────┼──────────┼──────────┼──────────┼──────────┤
│ 4 │ continuous-client │ default │ N/A │ fork │ 20052 │ 11h │ 1 │ online │ 100% │ 387.6mb │ phe… │ disabled │
│ 3 │ continuous-quick-server │ default │ N/A │ fork │ 20049 │ 11h │ 0 │ online │ 0% │ 102.7mb │ phe… │ disabled │
│ 1 │ continuous-server │ default │ N/A │ fork │ 20043 │ 11h │ 10 │ online │ 0% │ 346.3mb │ phe… │ disabled │
│ 0 │ phettest-server │ default │ 0.0.0 │ fork │ 20035 │ 11h │ 2 │ online │ 0% │ 43.8mb │ phe… │ disabled │
│ 2 │ yotta-server │ default │ 0.0.0 │ fork │ 20068 │ 11h │ 2 │ online │ 0% │ 37.1mb │ phe… │ disabled │
└─────┴────────────────────────────┴─────────────┴─────────┴─────────┴──────────┴────────┴──────┴───────────┴──────────┴──────────┴──────────┴──────────┘
that pgrep -c chrome
reported 1579. Also, ps aux --no-heading
showed many as defunct. Also there were many listings from perennial:
...
phet-ad+ 40211 0.5 0.0 0 0 ? Zs 18:21 0:48 [chrome] <defunct>
phet-ad+ 40408 0.3 0.0 0 0 ? Zs 15:19 1:00 [chrome] <defunct>
phet-ad+ 40414 0.0 0.0 17469332 57344 ? Ssl 16:48 0:03 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40421 0.0 0.0 17299480 35760 ? S 16:48 0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40423 0.0 0.0 17299464 35764 ? S 16:48 0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40425 0.0 0.0 17299496 10972 ? S 16:48 0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40442 0.0 0.0 17492024 40044 ? Sl 16:48 0:04 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40445 0.0 0.0 25919264 35332 ? Sl 16:48 0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40555 0.0 0.0 17332372 37372 ? Sl 16:48 0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40727 0.0 0.0 17465132 57832 ? Ssl 16:48 0:04 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40730 0.0 0.0 17299480 35760 ? S 16:48 0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40732 0.0 0.0 17299464 35764 ? S 16:48 0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40734 0.0 0.0 17299500 10976 ? S 16:48 0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40748 0.0 0.0 17491828 31116 ? Sl 16:48 0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40755 0.0 0.0 25927528 35576 ? Sl 16:48 0:00 /data/share/phet/continuous-quick-server/perennial/node_mo
phet-ad+ 40909 0.6 0.0 0 0 ? Zs 18:21 0:58 [chrome] <defunct>
phet-ad+ 40927 0.3 0.0 0 0 ? Zs 18:21 0:30 [chrome] <defunct>
phet-ad+ 40942 0.6 0.0 0 0 ? Zs 18:21 0:53 [chrome] <defunct>
I also saw:
[phet-admin@bayes ~]$ pgrep -c node
44
After stopping continuous-server, I saw:
[phet-admin@bayes ~]$ pm2 stop continuous-server
[PM2] Applying action stopProcessId on app [continuous-server](ids: [ 1 ])
[phet-admin@bayes ~]$ pgrep -c chrome
1582
[phet-admin@bayes ~]$ pgrep -c node
35
These numbers don't make sense, since the number of chromes went from 1579 to 1582.
Then I stopped quick-server:
[phet-admin@bayes ~]$ pm2 stop continuous-quick-server
[PM2] Applying action stopProcessId on app [continuous-quick-server](ids: [ 3 ])
...
[phet-admin@bayes ~]$ pgrep -c chrome
784
[phet-admin@bayes ~]$ pgrep -c node
33
So that reduced chromes from 1582 to 784 and nodes from 35 to 33. Still disconcerting that we still have 784 chromes running.
After restarting those pm2 services, and running pgrep -c chrome
intermittently, I saw the number of chrome instances climb from 796 on startup to 811 after a few minutes.
I also saw https://github.com/puppeteer/puppeteer/issues/1825 which described defunct
zombie processes like we see here.
I also saw https://askubuntu.com/questions/201303/what-is-a-defunct-process-and-why-doesnt-it-get-killed which describes defunct
processes.
I used that process to find the PPID of the bad parent process, and I ran kill -9 20052
and saw that pgrep -c chrome
dropped to 94. One other process accounted for the other defunct chromes. I killed that parent process too.
I checked that pm2 is still operational, but after 5 minutes, there are already 50 defunct chrome processes.
During my review of quick server, I noticed two potential problems:
By the way, bayes has climbed to 204 chrome processes (as reported by pgrep -c chrome
) while I worked on this. UPDATE: 220
I stopped continuous-server and continuous-quick-server, pulled aqua and perennial in each, and restarted. The number of pgrep -c chrome
chrome instances held steady at 208 for a while, but just jumped to 241. But only a few (if any) look like defunct zombies.
UPDATE: Holding steady at 241 for quite a while.
The perennial commit seems like a bugfix, but warrants a code review since it is used in many places, and one or more usage sites could have been relying on the bug somehow? I marked blocks-publication and ready for review for that part.
Also, we are still around 300 chrome instances, and not many look like zombies. So there is probably still a problem of some sort.
This morning (10 hours later than preceding comment), I observed:
[phet-admin@bayes ~]$ pgrep -c chrome
927
With around 600 marked as defunct
according to ps -ef | grep defunct
.
Great fix! I was using this command to see how many chrome instances there were ps aux | grep chrome | wc -l
.
We have to apply this to the continuous-client process too. I just did this. I bet that will solve the rest of the defunct guys.
After pulling/restarting continuous-client, I see no defunct chrome instances. There were 307 upon restart, and are now 313. I believe this is solved, but we can check in tomorrow. Thanks so much @samreid for finding that bug!
We are still hovering around 330 chrome instances. Closing
In general things were working well, but when there were errors outside the expected places for the webpage, puppeteerLoad still wasn't cleaning up. Thus, above I added even more cleanup.
When looking at the error logs, I see the page.goto
has caused a fair number of errors over the past months. I wonder if that contributed to the errors. I'll check into to see if that helps things soon.
This happened again tonight, and my best guess is that it was because https://github.com/phetsims/aqua/commit/1de7fdf7b10ef32d41b96404201de183b37c6739 was launching too many chrome instances. I restarted things and pushed a fix over in https://github.com/phetsims/aqua/issues/152 and ps aux | grep chrome | wc -l
is showing steadily with 88 chrome instances.
After the weekend we are at 112 chrome instances. Perhaps there is still a slow problem. After stopping all CT processes, there are still 25 zombies. I will keep tabs on these.
Now there are 184 chrome instances. That is still growing. I need to continue to work here.
@mattpen and I spoke about this yesterday. We feel like it is important to improve the logging in our puppeteer processes, and then from there figure out where the error case is coming from. I'll take the lead!
I think that https://github.com/phetsims/perennial/issues/291 could be helpful here, or at least something to note.
We are running into this over in https://github.com/phetsims/special-ops/issues/234
Most likely our sparky case will be solved by https://github.com/phetsims/aqua/issues/172.
Ok. After some discussion with @mattpen, The above commit should handle the orphaned chrome instances that are much more rare from the build process. I'm ready to close this issue, but I'll reopen if we need to track down any more orphaned chrome instances.
I think this has something to do with bayes and the number of puppeteer instances on it, but I'm not really sure. I'll have to take a look.
This looks to be exactly the same as https://github.com/puppeteer/puppeteer/issues/6757 (except docker). I'll see if there is something I need to do better about closing it down.