Closed stavert closed 2 years ago
Hey All,
I've finally found some time to play around some more this evening. I've been able to narrow down to a specific query which causes the CPU to spike. @testn I've actually found that this old section of code (probably through a lot of schema extensions over the years) is, in fact, querying/returning 313 columns. My apologies for saying that I didn't think we would have over 300. I have an API whittled down to just this single query which causes the issue.
If I stop interacting with my test site/API and let the instance just sit for approximately 20 minutes, the CPU will settle down. I had strace running and happened to capture the following right as the node process settled:
It would probably take a large effort to try to rewrite this query if it's the joins but, certainly, I think we could whittle down the attributes being returned so we're not returning entire tables columns.
I will be consulting with some of our DB experts further.
Of course, as bad as this appears to be, I have to pursue to see if we can figure out what package change(s) may have caused this to begin to rear it's ugly head so we can drive to a full explanation, or even a fix?
@ddolcimascolo Can you speak to your patterns at all to see if you're seeing anything similar?
Ain't it fun!!
Another strange observation....
When I let the CPU settle, it becomes harder and harder to get the CPU to peg. I'm now hitting this API and the problem will not occur. It seems that if I start my node process from scratch, I can hit this API/query maybe two times and the CPU pegs. Wait 20 minutes for settle, try again and it takes longer. I waited then for the settle to occur after another 20 minutes and was then unable to recreate the problem.
@testn @ddolcimascolo 99% of time in epoll_wait seems expected to me - this is how event loop in node works, most of the time process is in the sleeping state waiting for timeout/network events from file handles.
@testn could above comments be a sign of GC pressure? I'll try to recreate load test with sequelize and 300+ number of columns
the profiler trace from @stavert seems to indicate that gc time is only around 1% only. It would be nice if @stavert can add --trace-gc and compare both. The fact that the new code iterates through each row much faster could cause higher churn on the GC.
@stavert if you can email me the memory heap profile when CPU is pegged, it would be great.
I can confirm the "20 minutes to settle" pattern, though the actual duration might vary a bit. I'll check our queries for the number of returned columns, but we do lots of joins so that'd be possible I suppose
@stavert I think you have to run strace with '-F' parameter to make it follow other threads in the same process. Your strace does not seem to have any detail at all.
@sidorares garbage collection time seems to be <2%. I doubt that it is the GC pressure.
@sidorares @testn @stavert Just to let you know that at 2pm (Paris time) today my production instances all went to 100% leading to a degraded service, up to a point where we needed to restart the service. strace
ing very quickly before restarting showed the epoll_wait
loop but nothing else. I'm starting to think this epoll_wait
stuff is a dead end as @sidorares suggested.
I've scheduled a revert to v2.3.0 for next friday, hoping you find something before that.
Cheers, David
Guys, I focused on the memory and found this using "Allocation sampling" in Chrome debugger, attached to a process stucked at 100% CPU
Lots of allocations are done by mysql2 in parser, readField, etc. Looks very similar to changes commited between 2.3.0 and 2.3.2 as far as I can tell from @testn comments above...
Here's the complete profile
Heap-20211102T161324.heapprofile.txt
EDIT: And memory usage keeps growing untill I manually restart the process
Can you do a heap dump?
@ddolcimascolo how many fields do you have in your query?
Can you do a heap dump?
I tried, it makes the process crash everytime
Hey Guys,
I've now run through git bisect twice. I've not actually used bisect before but I think I've done it correctly. Turns out the issue was present in v2.3.1 also so I only had to bisect between 2.3.0 and 2.3.1. Twice now this commit has turned out to be the offender.
$ git bisect good 0ba4f87eb9a08b43d993c8f3db1a524c96933d8f is the first bad commit commit 0ba4f87eb9a08b43d993c8f3db1a524c96933d8f Author: testn test1@doramail.com Date: Sat Oct 9 20:38:55 2021 +0700
Convert function into class to cache the field definition
:040000 040000 fec4611f80b781b038fa14abf08f9038aed62865 9a00e31b07562e20b75cdd55177e6469d2aa2388 M lib :040000 040000 065ac174f3fbe43ea7e26afec7ca7ba8902e01b6 bb639caab24f152ca527932b689856c4e828693b M test
The steps I took (so you can check my work):
git bisect start git bisect bad v2.3.1 git bisect good v2.3.0 zip files and push them to server, restart server, and hit my failing API
The first 2 iterations were good so I marked them "git bisect good". The 3rd and 4th iteration were bad so I marked them "git bisect bad". The 5th iteration was the last one which was good so I marked it and it gave me the above output.
I've been having trouble getting Chrome to attach remotely for heap so I wanted to try today to get to the offending commit if possible.
I really hope this is helpful!
Steve
@ddolcimascolo how many fields do you have in your query?
I actually haven't isolated the query, yet. I just find processes at 100% cpu everytime I check
@stavert That's great, thx for your analysis. I hope @sidorares and @testn can look into it soon. Maybe my info about memory helps, too.
What was the original problem that you tried to solve with this commit @testn?
@sidorares did wonder about #1402 originally.
sorry, mouse slipped...didn't mean to close...
@stavert , can you pull this change and see if it fixes the problem for you?
https://github.com/testn/node-mysql2/commit/232988382a3afc362975dffc25a64a98e80acff5
If it does fix it, I think the problem that you and @ddolcimascolo saw was probably something to do with the v8 compiler where it cannot handle long function very well. This change will just remove the unrolling of typecast wrapper and make the constructor function to be much shorter.
@testn that seems like a very likely scenario for what is happening: the generated source code becomes too long to be handled efficiently by optimising compiler and performance degrades significantly. Length of a generated parser function is proportional to number of columns, there is probably a max threshold after which it's never optimised
need to check https://github.com/vhf/v8-bailout-reasons and maybe use example code from https://github.com/petkaantonov/bluebird/wiki/Optimization-killers
function printStatus(fn) {
switch(%GetOptimizationStatus(fn)) {
case 1: console.log("Function is optimized"); break;
case 2: console.log("Function is not optimized"); break;
case 3: console.log("Function is always optimized"); break;
case 4: console.log("Function is never optimized"); break;
case 6: console.log("Function is maybe deoptimized"); break;
case 7: console.log("Function is optimized by TurboFan"); break;
default: console.log("Unknown optimization status"); break;
}
}
@testn i feel like an idiot but this is really the first time I've worked in github. I'm having trouble pulling your commit. I've tried a fetch, then a cherry-pick of the commit but I don't see your branch locally. I keep getting bad-object.
@stavert you can npm install a branch/commit - https://stackoverflow.com/questions/14187956/npm-install-from-git-in-a-specific-version/18049685
{
"dependencies": {
"mysql2": "git@github.com:testn/node-mysql2.git#232988382a3afc362975dffc25a64a98e80acff5"
}
}
@sidorares I'm not having much luck as I'm just not setup to pull from github from either the test server or my windows laptop. I've been trying to get things to work as:
https://docs.github.com/en/authentication/troubleshooting-ssh/error-permission-denied-publickey
but just not having any luck with getting it setup. Is there no other way that I can just pull this commit into my local repo so I can zip up the directory and move it to my server?
you can simply copy query.js and text_parser.js into your folder.
got it...testing now....need to run some positive and negative tests....
@sidorares @testn @ddolcimascolo It looks good!!!!!
I've swapped the two files back and forth about 3 times now and it's either working or failing depending on which files I have in place. You guys rock!
@testn can you make a PR with your change? also maybe add "lots of columns" case to the perf benchmark. Once merged I'll cut 2.3.3 release
@sidorares I already tried with a couple hundred columns on my machine but I cannot replicate the problem. I'm not sure whether it is platform-dependent though
@stavert do you see any perf improvement between 2.3.0 and the patched version?
I'll check tomorrow and let you know...
On November 2, 2021 10:31:10 PM testn @.***> wrote:
@staverthttps://github.com/stavert do you see any perf improvement between 2.3.0 and the patched version?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sidorares/node-mysql2/issues/1432#issuecomment-958605979, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM2OMTXP54HTRW7NJV6S3G3UKCNGVANCNFSM5GQZY2PA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@sidorares let me try to find a way to re-create the problem.
@testn i'm very sorry but have not had any chance today to do some perf comparisons. Hopefully tomorrow.
@testn Ran some raw perf tests with curl on the instance directly (localhost). 100 iterations of curl to my full API with all queries now. I see, on average, about 10% improvement. Here's some logs:
2.3.0.perf.log 2.3.2-patched.perf.log
It was nice to blast through the 100 calls, watch the CPU go up but never really peg and then immediately drop once the script was done.
Here's some additional logs from an API which takes a little longer:
@sidorares Any chance you can release a version including the patch soon, like, today?
Thx!
@ddolcimascolo I'll push -rc to npm today
@ddolcimascolo I'll push -rc to npm today
do you want me to send the PR?
btw, I'm still trying to recreate the problem
yes please if you have time @testn . We also need to fix failing coverage job ( or at least disable it for now to unblock other PRs. For some reason https://github.com/ewjoachim/coverage-comment-action is failing )
I think it worth publishing rc even if we not 100% sure if that fixes root cause
@testn that seems like a very likely scenario for what is happening: the generated source code becomes too long to be handled efficiently by optimising compiler and performance degrades significantly. Length of a generated parser function is proportional to number of columns, there is probably a max threshold after which it's never optimised
need to check https://github.com/vhf/v8-bailout-reasons and maybe use example code from https://github.com/petkaantonov/bluebird/wiki/Optimization-killers
function printStatus(fn) { switch(%GetOptimizationStatus(fn)) { case 1: console.log("Function is optimized"); break; case 2: console.log("Function is not optimized"); break; case 3: console.log("Function is always optimized"); break; case 4: console.log("Function is never optimized"); break; case 6: console.log("Function is maybe deoptimized"); break; case 7: console.log("Function is optimized by TurboFan"); break; default: console.log("Unknown optimization status"); break; } }
I think CrankShaft is not used anymore. It must be a bug in TurboFan that I cannot find it yet.
@ddolcimascolo published v2.3.3-rc.0
@stavert @ddolcimascolo can you confirm that it fixes the problem?
I'm sorry guys, quite a busy day here... I was actually expecting renovate (the tool we're using to manage dependencies upgrades) to pick the new version automatically so I didn't bother that much while waiting for the release. For a reason that I don't get, it didn't :(
I will manually upgrade, but not before next week. I'll try to keep you posted.
Cheers
David
Hi all,
I've rolled out v2.3.3-rc.0 at 0:00 (Paris time), so far so good as per the attached charts
Cheers, David
@sidorares Sorry for all trouble but let's cut 2.3.3?
yea seems like it solved the main problem. Any luck reliably reproduce it so we can add to perf or unit test?
yea seems like it solved the main problem. Any luck reliably reproduce it so we can add to perf or unit test?
Sadly, I'm still trying to reproduce the problem. It looks like it has to hit certain threshold before it can happen. It might also be a version or platform specific problem since I cannot reproduce it on Node 16 on Windows yet.
@sidorares Sorry for all trouble but let's cut 2.3.3?
@testn done
Hey guys, it's been a week without issues. Renovate did pick up v2.3.3 that you released in the meantime.
All good
Thx, David
We deployed some new code a couple of days ago into our AWS infrastructure and within a few hours we began receiving alarms of our CPUs running at 100% across our instances.
Our node/express server runs as a cluster across 8 CPUs on mulitiple instances. As the problem continues, more and more of the CPUs get bound to 100%. We've been able to prove that by locking mysql2@2.3.0 the problem doesn't exist. However, with the new 2.3.2, the problem occurs.
The following image shows the instance on the left (in top command output) is running 2.3.0 mysql2 whereas the instance on the right is running 2.3.2.
Eventually, all CPUs will become 100% in the failing case.
Thank you.