sidorares / node-mysql2

:zap: fast mysqljs/mysql compatible mysql driver for node.js
https://sidorares.github.io/node-mysql2/
MIT License
4.08k stars 620 forks source link

mysql2@2.3.2 seems to run CPU to 100% #1432

Closed stavert closed 2 years ago

stavert commented 3 years ago

We deployed some new code a couple of days ago into our AWS infrastructure and within a few hours we began receiving alarms of our CPUs running at 100% across our instances.

Our node/express server runs as a cluster across 8 CPUs on mulitiple instances. As the problem continues, more and more of the CPUs get bound to 100%. We've been able to prove that by locking mysql2@2.3.0 the problem doesn't exist. However, with the new 2.3.2, the problem occurs.

The following image shows the instance on the left (in top command output) is running 2.3.0 mysql2 whereas the instance on the right is running 2.3.2.

image

Eventually, all CPUs will become 100% in the failing case.

Thank you.

stavert commented 3 years ago

Hey All,

I've finally found some time to play around some more this evening. I've been able to narrow down to a specific query which causes the CPU to spike. @testn I've actually found that this old section of code (probably through a lot of schema extensions over the years) is, in fact, querying/returning 313 columns. My apologies for saying that I didn't think we would have over 300. I have an API whittled down to just this single query which causes the issue.

If I stop interacting with my test site/API and let the instance just sit for approximately 20 minutes, the CPU will settle down. I had strace running and happened to capture the following right as the node process settled:

strace.log

It would probably take a large effort to try to rewrite this query if it's the joins but, certainly, I think we could whittle down the attributes being returned so we're not returning entire tables columns.

I will be consulting with some of our DB experts further.

Of course, as bad as this appears to be, I have to pursue to see if we can figure out what package change(s) may have caused this to begin to rear it's ugly head so we can drive to a full explanation, or even a fix?

@ddolcimascolo Can you speak to your patterns at all to see if you're seeing anything similar?

Ain't it fun!!

stavert commented 3 years ago

Another strange observation....

When I let the CPU settle, it becomes harder and harder to get the CPU to peg. I'm now hitting this API and the problem will not occur. It seems that if I start my node process from scratch, I can hit this API/query maybe two times and the CPU pegs. Wait 20 minutes for settle, try again and it takes longer. I waited then for the settle to occur after another 20 minutes and was then unable to recreate the problem.

sidorares commented 3 years ago

@testn @ddolcimascolo 99% of time in epoll_wait seems expected to me - this is how event loop in node works, most of the time process is in the sleeping state waiting for timeout/network events from file handles.

@testn could above comments be a sign of GC pressure? I'll try to recreate load test with sequelize and 300+ number of columns

testn commented 3 years ago

the profiler trace from @stavert seems to indicate that gc time is only around 1% only. It would be nice if @stavert can add --trace-gc and compare both. The fact that the new code iterates through each row much faster could cause higher churn on the GC.

testn commented 3 years ago

@stavert if you can email me the memory heap profile when CPU is pegged, it would be great.

ddolcimascolo commented 3 years ago

I can confirm the "20 minutes to settle" pattern, though the actual duration might vary a bit. I'll check our queries for the number of returned columns, but we do lots of joins so that'd be possible I suppose

testn commented 3 years ago

@stavert I think you have to run strace with '-F' parameter to make it follow other threads in the same process. Your strace does not seem to have any detail at all.

testn commented 3 years ago

@sidorares garbage collection time seems to be <2%. I doubt that it is the GC pressure.

ddolcimascolo commented 3 years ago

@sidorares @testn @stavert Just to let you know that at 2pm (Paris time) today my production instances all went to 100% leading to a degraded service, up to a point where we needed to restart the service. straceing very quickly before restarting showed the epoll_wait loop but nothing else. I'm starting to think this epoll_wait stuff is a dead end as @sidorares suggested.

I've scheduled a revert to v2.3.0 for next friday, hoping you find something before that.

Cheers, David

ddolcimascolo commented 3 years ago

Guys, I focused on the memory and found this using "Allocation sampling" in Chrome debugger, attached to a process stucked at 100% CPU

image

Lots of allocations are done by mysql2 in parser, readField, etc. Looks very similar to changes commited between 2.3.0 and 2.3.2 as far as I can tell from @testn comments above...

Here's the complete profile

Heap-20211102T161324.heapprofile.txt

EDIT: And memory usage keeps growing untill I manually restart the process

testn commented 3 years ago

Can you do a heap dump?

testn commented 3 years ago

@ddolcimascolo how many fields do you have in your query?

ddolcimascolo commented 3 years ago

Can you do a heap dump?

I tried, it makes the process crash everytime

stavert commented 3 years ago

Hey Guys,

I've now run through git bisect twice. I've not actually used bisect before but I think I've done it correctly. Turns out the issue was present in v2.3.1 also so I only had to bisect between 2.3.0 and 2.3.1. Twice now this commit has turned out to be the offender.

$ git bisect good 0ba4f87eb9a08b43d993c8f3db1a524c96933d8f is the first bad commit commit 0ba4f87eb9a08b43d993c8f3db1a524c96933d8f Author: testn test1@doramail.com Date: Sat Oct 9 20:38:55 2021 +0700

Convert function into class to cache the field definition

:040000 040000 fec4611f80b781b038fa14abf08f9038aed62865 9a00e31b07562e20b75cdd55177e6469d2aa2388 M lib :040000 040000 065ac174f3fbe43ea7e26afec7ca7ba8902e01b6 bb639caab24f152ca527932b689856c4e828693b M test

The steps I took (so you can check my work):

git bisect start git bisect bad v2.3.1 git bisect good v2.3.0 zip files and push them to server, restart server, and hit my failing API

The first 2 iterations were good so I marked them "git bisect good". The 3rd and 4th iteration were bad so I marked them "git bisect bad". The 5th iteration was the last one which was good so I marked it and it gave me the above output.

I've been having trouble getting Chrome to attach remotely for heap so I wanted to try today to get to the offending commit if possible.

I really hope this is helpful!

Steve

ddolcimascolo commented 3 years ago

@ddolcimascolo how many fields do you have in your query?

I actually haven't isolated the query, yet. I just find processes at 100% cpu everytime I check

ddolcimascolo commented 3 years ago

@stavert That's great, thx for your analysis. I hope @sidorares and @testn can look into it soon. Maybe my info about memory helps, too.

What was the original problem that you tried to solve with this commit @testn?

stavert commented 3 years ago

@sidorares did wonder about #1402 originally.

stavert commented 3 years ago

sorry, mouse slipped...didn't mean to close...

testn commented 3 years ago

@stavert , can you pull this change and see if it fixes the problem for you?

https://github.com/testn/node-mysql2/commit/232988382a3afc362975dffc25a64a98e80acff5

If it does fix it, I think the problem that you and @ddolcimascolo saw was probably something to do with the v8 compiler where it cannot handle long function very well. This change will just remove the unrolling of typecast wrapper and make the constructor function to be much shorter.

sidorares commented 3 years ago

@testn that seems like a very likely scenario for what is happening: the generated source code becomes too long to be handled efficiently by optimising compiler and performance degrades significantly. Length of a generated parser function is proportional to number of columns, there is probably a max threshold after which it's never optimised

need to check https://github.com/vhf/v8-bailout-reasons and maybe use example code from https://github.com/petkaantonov/bluebird/wiki/Optimization-killers

function printStatus(fn) {
    switch(%GetOptimizationStatus(fn)) {
        case 1: console.log("Function is optimized"); break;
        case 2: console.log("Function is not optimized"); break;
        case 3: console.log("Function is always optimized"); break;
        case 4: console.log("Function is never optimized"); break;
        case 6: console.log("Function is maybe deoptimized"); break;
        case 7: console.log("Function is optimized by TurboFan"); break;
        default: console.log("Unknown optimization status"); break;
    }
}  
stavert commented 3 years ago

@testn i feel like an idiot but this is really the first time I've worked in github. I'm having trouble pulling your commit. I've tried a fetch, then a cherry-pick of the commit but I don't see your branch locally. I keep getting bad-object.

sidorares commented 3 years ago

@stavert you can npm install a branch/commit - https://stackoverflow.com/questions/14187956/npm-install-from-git-in-a-specific-version/18049685

sidorares commented 3 years ago
{
  "dependencies": {
      "mysql2": "git@github.com:testn/node-mysql2.git#232988382a3afc362975dffc25a64a98e80acff5"
  }
}
stavert commented 3 years ago

@sidorares I'm not having much luck as I'm just not setup to pull from github from either the test server or my windows laptop. I've been trying to get things to work as:

https://docs.github.com/en/authentication/troubleshooting-ssh/error-permission-denied-publickey

but just not having any luck with getting it setup. Is there no other way that I can just pull this commit into my local repo so I can zip up the directory and move it to my server?

testn commented 3 years ago

you can simply copy query.js and text_parser.js into your folder.

stavert commented 3 years ago

got it...testing now....need to run some positive and negative tests....

stavert commented 3 years ago

@sidorares @testn @ddolcimascolo It looks good!!!!!

I've swapped the two files back and forth about 3 times now and it's either working or failing depending on which files I have in place. You guys rock!

sidorares commented 3 years ago

@testn can you make a PR with your change? also maybe add "lots of columns" case to the perf benchmark. Once merged I'll cut 2.3.3 release

testn commented 3 years ago

@sidorares I already tried with a couple hundred columns on my machine but I cannot replicate the problem. I'm not sure whether it is platform-dependent though

testn commented 3 years ago

@stavert do you see any perf improvement between 2.3.0 and the patched version?

stavert commented 3 years ago

I'll check tomorrow and let you know...

On November 2, 2021 10:31:10 PM testn @.***> wrote:

@staverthttps://github.com/stavert do you see any perf improvement between 2.3.0 and the patched version?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sidorares/node-mysql2/issues/1432#issuecomment-958605979, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM2OMTXP54HTRW7NJV6S3G3UKCNGVANCNFSM5GQZY2PA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

testn commented 3 years ago

@sidorares let me try to find a way to re-create the problem.

stavert commented 3 years ago

@testn i'm very sorry but have not had any chance today to do some perf comparisons. Hopefully tomorrow.

stavert commented 3 years ago

@testn Ran some raw perf tests with curl on the instance directly (localhost). 100 iterations of curl to my full API with all queries now. I see, on average, about 10% improvement. Here's some logs:

2.3.0.perf.log 2.3.2-patched.perf.log

It was nice to blast through the 100 calls, watch the CPU go up but never really peg and then immediately drop once the script was done.

stavert commented 3 years ago

Here's some additional logs from an API which takes a little longer:

2.3.0-new.perf.log 2.3.2-patched--new.perf.log

ddolcimascolo commented 3 years ago

@sidorares Any chance you can release a version including the patch soon, like, today?

Thx!

sidorares commented 3 years ago

@ddolcimascolo I'll push -rc to npm today

testn commented 3 years ago

@ddolcimascolo I'll push -rc to npm today

do you want me to send the PR?

testn commented 3 years ago

btw, I'm still trying to recreate the problem

sidorares commented 3 years ago

yes please if you have time @testn . We also need to fix failing coverage job ( or at least disable it for now to unblock other PRs. For some reason https://github.com/ewjoachim/coverage-comment-action is failing )

I think it worth publishing rc even if we not 100% sure if that fixes root cause

testn commented 3 years ago

@testn that seems like a very likely scenario for what is happening: the generated source code becomes too long to be handled efficiently by optimising compiler and performance degrades significantly. Length of a generated parser function is proportional to number of columns, there is probably a max threshold after which it's never optimised

need to check https://github.com/vhf/v8-bailout-reasons and maybe use example code from https://github.com/petkaantonov/bluebird/wiki/Optimization-killers

function printStatus(fn) {
    switch(%GetOptimizationStatus(fn)) {
        case 1: console.log("Function is optimized"); break;
        case 2: console.log("Function is not optimized"); break;
        case 3: console.log("Function is always optimized"); break;
        case 4: console.log("Function is never optimized"); break;
        case 6: console.log("Function is maybe deoptimized"); break;
        case 7: console.log("Function is optimized by TurboFan"); break;
        default: console.log("Unknown optimization status"); break;
    }
}  

I think CrankShaft is not used anymore. It must be a bug in TurboFan that I cannot find it yet.

sidorares commented 3 years ago

@ddolcimascolo published v2.3.3-rc.0

testn commented 3 years ago

@stavert @ddolcimascolo can you confirm that it fixes the problem?

ddolcimascolo commented 3 years ago

I'm sorry guys, quite a busy day here... I was actually expecting renovate (the tool we're using to manage dependencies upgrades) to pick the new version automatically so I didn't bother that much while waiting for the release. For a reason that I don't get, it didn't :(

I will manually upgrade, but not before next week. I'll try to keep you posted.

Cheers

David

ddolcimascolo commented 3 years ago

Hi all,

I've rolled out v2.3.3-rc.0 at 0:00 (Paris time), so far so good as per the attached charts image

Cheers, David

testn commented 3 years ago

@sidorares Sorry for all trouble but let's cut 2.3.3?

sidorares commented 3 years ago

yea seems like it solved the main problem. Any luck reliably reproduce it so we can add to perf or unit test?

testn commented 3 years ago

yea seems like it solved the main problem. Any luck reliably reproduce it so we can add to perf or unit test?

Sadly, I'm still trying to reproduce the problem. It looks like it has to hit certain threshold before it can happen. It might also be a version or platform specific problem since I cannot reproduce it on Node 16 on Windows yet.

sidorares commented 3 years ago

@sidorares Sorry for all trouble but let's cut 2.3.3?

@testn done

ddolcimascolo commented 2 years ago

Hey guys, it's been a week without issues. Renovate did pick up v2.3.3 that you released in the meantime.

All good

Thx, David