sidorares / node-mysql2

:zap: fast mysqljs/mysql compatible mysql driver for node.js
https://sidorares.github.io/node-mysql2/
MIT License
4.07k stars 619 forks source link

mysql2@2.3.2 seems to run CPU to 100% #1432

Closed stavert closed 2 years ago

stavert commented 3 years ago

We deployed some new code a couple of days ago into our AWS infrastructure and within a few hours we began receiving alarms of our CPUs running at 100% across our instances.

Our node/express server runs as a cluster across 8 CPUs on mulitiple instances. As the problem continues, more and more of the CPUs get bound to 100%. We've been able to prove that by locking mysql2@2.3.0 the problem doesn't exist. However, with the new 2.3.2, the problem occurs.

The following image shows the instance on the left (in top command output) is running 2.3.0 mysql2 whereas the instance on the right is running 2.3.2.

image

Eventually, all CPUs will become 100% in the failing case.

Thank you.

sidorares commented 3 years ago

that's concerning, thanks for the report @stavert .

How easy is to replicate the issue? Do you have any automated integration / perf test that can show it? If it does not require much time ideally would be great to bisect to a single commit

Can you also share your setup ( server version, node version, OS ). Do you use this module directly or from ORM/query builder?

stavert commented 3 years ago

Hey @sidorares!

Thanks for getting back to me!

For us, turns out it's very easy by swapping the version back and forth. As this was our prod environment, we really didn't have the luxury of debugging further once we realized this package was the culprit. So unfortunately I can't really tell you how to recreate in a small test case yet. Unfortunately, we never really observed it in our QA environment but our QA machine is an "all-in-one" environment with only a couple of processors.

Here's some additional info...

We run nodeJS v12.14.1,

Installed sequelize:

jubi.api@3.11.4 /var/www/jubi.api └── sequelize@5.22.4

We have a ticket to move sequelize to v6 but haven't had a chance to do it yet.

OS:

NAME="Ubuntu" VERSION="18.04.6 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.6 LTS" VERSION_ID="18.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=bionic UBUNTU_CODENAME=bionic

Linux ip-172-31-46-194 4.15.0-161-generic #169-Ubuntu SMP Fri Oct 15 13:41:54 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

package.json dependencies (with exception of mysql2 locked at 2.3.0):

"@okta/jwt-verifier": "^2.3.0",
"@rali/lx-data-models": "1.0.55",
"async": "~0.9.0",
"atob": "^2.1.2",
"aws-sdk": "^2.814.0",
"bcrypt-nodejs": "~0.0.3",
"body-parser": "^1.18.2",
"btoa": "^1.2.1",
"compression": "^1.6.2",
"cookie-parser": "~1.3.3",
"csv": "^5.0.0",
"debug": "^2.6.9",
"ejs": "^3.1.6",
"email-validator": "~2.0.3",
"express": "^4.16.0",
"express-subdomain": "~1.0.2",
"fluent-ffmpeg": "~2.1.2",
"formidable": "~1.0.17",
"glob": "^5.0.15",
"gm": "^1.23.1",
"helmet": "^3.23.3",
"https": "^1.0.0",
"juice": "^8.0.0",
"less-middleware": "^3.1.0",
"log4js": "^6.3.0",
"mime": "^1.4.1",
"moment": "^2.29.1",
"mysql": "^2.18.1",
"mysql2": "2.3.0",
"node-cache": "^5.1.2",
"paralleljs": "^0.2.1",
"passport": "~0.2.0",
"passport-http": "~0.2.2",
"passport-http-bearer": "~1.0.1",
"passport-jwt": "^4.0.0",
"passport-saml": "^2.1.0",
"q": "~1.1.2",
"request": "^2.88.2",
"sequelize": "^5.22.3",
"sequelize-fixtures": "^0.8.0",
"serve-favicon": "^2.4.5",
"umzug": "^1.12.0",
"underscore": "^1.12.1",
"uuid": "^3.4.0",
"valid-url": "^1.0.9",
"xml2js": "^0.4.19"

Our sequelize connection pool config:

  pool: {
    max: 30,
    min: 5,
    idle: 1000000,
    acquire: 100000
  },

Our DB is AWS RDS MySQL, v5.7.33.

We have a lot of complicated SQL, some stored procedures and even some raw inline SQL but all called through sequelize.

Please let me know what else I can provide you.

Thanks very much!!

stavert commented 3 years ago

Oh, one other thing to mention... while I'm not a node deadlock/debugging expert, I did try to attach the inspector before working on the potential package changes during deployment. When I typed "pause", nothing showed up...no stack backtrace was observed.

sidorares commented 3 years ago

one more question - once appear, does 100% cpu load continue when you turn off the traffic? e.i is this a severe regression in performance or a bug where it completely hangs in a loop

stavert commented 3 years ago

For the most part yes. If I remember correctly, as I de-registered one of the instances from the load balancer, the CPUs stayed at 100%. And come to think of it, we use forever to keep the node processes running, forever stopall orphaned all the processes which were running at 100% CPU and I had to kill them manually.

We did originally think that somehow forever was causing us the problem, then we also realized we had a package inclusion of a 10 year old cluster npm instead of just relying on the nodeJS built-in. We removed that from package.json but the problem persisted and that's when I focused on DB connection stuff and got to mysql2.

stavert commented 3 years ago

Oh, to comment on performance, weirdly enough, our site was still relatively stable from an end user perspective. Fortunately, our load was a little low. I was able to interact with our site with no observable difference. However, we have times where we would have hundreds of users on the site and I can imagine that if this had happened during that time, we would have a significant problem on our hands.

It's still a little tough to say overall though because watching logs through all this seemed as if everything was running fine.

sidorares commented 3 years ago

I did a simple load test ( http server returning 4 users row, model from https://sequelize.org/master/manual/model-basics.html ) and bottleneck is mysql on my laptop.

6 cores, wrk -t 10 -c 20 -d 1m http://localhost:3000/ load test command

image

Running 1m test @ http://localhost:3000/
  10 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     7.26ms    2.46ms 104.72ms   97.39%
    Req/Sec   279.11     31.18   360.00     76.75%
  166942 requests in 1.00m, 109.22MB read
Requests/sec:   2778.66
Transfer/sec:      1.82MB

using the driver directly results in 50% better performance - 3906.30 req/sec

ddolcimascolo commented 3 years ago

Hi everyone,

I'm in the same boat after having reported #1418 and upgraded to 2.3.2. My APM charts are self-explaining, showing both increased CPU usage and increased memory usage. I can also confirm node processes bumping to 100% CPU for 5 or 10 minutes, then getting back to normal. We also deploy a lot, so making node processes restart using pm2-managed rolling restarts.

image image

The service is still given, we're just seeing higher loads and memory usage

ddolcimascolo commented 3 years ago

@stavert Do you use JSON columns in MySQL ? @sidorares Do you need any help ? This problem is concerning for production services relying on your driver :)

stavert commented 3 years ago

HI,

Not generally as far as I can tell.

I'm traveling on business through Thursday this week and might not get much chance to perform further debuggimg/isolation.

I will try to collect more info first chance I get..

On October 25, 2021 5:15:47 AM David DOLCIMASCOLO @.***> wrote:

@staverthttps://github.com/stavert Do you use JSON columns in MySQL ? @sidorareshttps://github.com/sidorares Do you need any help ? This problem is concerning for production services relying on your driver :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sidorares/node-mysql2/issues/1432#issuecomment-950705759, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM2OMTTAEGOKBEUF4YP5WJDUIUN37ANCNFSM5GQZY2PA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

sidorares commented 3 years ago

@sidorares Do you need any help ?

@ddolcimascolo yes, I need help replicating the issue. My simple initial tests were unsuccessful in finding a regression. it might be the issue is specific to a type or typeCast configuration or driver + sequelize combo, not sure yet. Once I can replicate it would be much easier for me to help finding and fixing root cause

stavert commented 3 years ago

Hi All,

I made a little progress, with limited time today, in recreating on a smaller unused environment (non production).

This was a t2 instance with just 4 cpus and I was able to get one cpu to spin.

I'm hoping I can find more time tomorrow in this environment to begin to isolate further.

sidorares commented 3 years ago

many thanks @stavert , appreciate the effort

stavert commented 3 years ago

Hey All,

So I can recreate this issue at will on this test machine. It's very very easy. I just start the service, get on my site, jump a few pages and at least one cpu 100%. I've been playing with trying to connect to the node inspector remotely, tick log processing, core dumps, etc, but just haven't gotten the "secret sauce" yet to getting a good debugging experience that will give me confidence that I've narrowed down the problem (or recreation) for you. Each page on our site performs a lot of SQL via sequelize so it's just difficult to narrow down.

@sidorares what advice could you give me to debug this further? Perhaps we can jump on a call together to try to figure it out?

Let me know how you want to proceed. Thanks.

sidorares commented 3 years ago

@stavert can you start with node --prof , recreate 100% cpu , wait for maybe half a minute and then run node --prof-process to process profiler log and post results here ( please make sure there is no sensitive information in the log before posting - function names etc )

stavert commented 3 years ago

Will do. I actually already have it but need to scrub it.

I do see a lot of stuff related to connection manager. Will try to post, at least, relevant sections tomorrow.

stavert commented 3 years ago

@sidorares I think I've got everything redacted where it needed to be :-)

I do see a lot of this type of thing below the connection manager:

 35    1.0%    LazyCompile: *_typecast APP_PATH_REDACTED/node_modules/sequelize/lib/dialects/mysql/connection-manager.js:40:19
 11   31.4%      LazyCompile: *next :2608:9
  6   54.5%        LazyCompile: *row APP_PATH_REDACTED/node_modules/mysql2/lib/commands/query.js:230:6
  6  100.0%          LazyCompile: *handlePacket APP_PATH_REDACTED/node_modules/mysql2/lib/connection.js:400:15
  6  100.0%            LazyCompile: *<anonymous> APP_PATH_REDACTED/node_modules/mysql2/lib/connection.js:84:42

I hope this gives you what you need. Happy hunting!!


mysql2.perf.log

sidorares commented 3 years ago

the majority of the time profiler sees is spent in the application code -

9931 100.0% LazyCompile: ~compareSync APP_PATH_REDACTED/node_modules/bcrypt/bcrypt.js:164:50
9864 99.3% LazyCompile: ~ APP_PATH_REDACTED/controllers/auth.js:71:22

@stavert can you see this might be called differently based on mysql2 version?

sidorares commented 3 years ago

I can see sequelize is always using typecast ( https://github.com/sequelize/sequelize/blob/d1a2572f2b5811cdb8ffbd20e285db44dfc539ca/lib/dialects/mysql/connection-manager.js#L37-L42 ), which is unfortunate.

I wonder if #1402 is a contributor to perf regression. @testn WDYT?

testn commented 3 years ago

The optimization I did is mainly to reduce the CPU consumption that _typeCast is using. My internal application yields at least 100% performance improvement for the query that returns a lot of rows.

testn commented 3 years ago

@sidorares @stavert I concurred with @sidorares 's evaluation. The code seems to spin on bcrypt almost 100% of the time. You should probably not use compareSync as it will pin down the event loop.

stavert commented 3 years ago

I installed mysql2@2.3.0, restarted and took another profile snapshot. All runs just fine at this version.

The patterns look very much different:

mysql2-2.3.0-perf.log

testn commented 3 years ago

@stavert do you mind jumping on gitter.im so we can chat? I need some information from you.

stavert commented 3 years ago

@testn I can certainly explore that. However, you need to know that this code is legacy (compareSync) which has been running for the better part of 5 years. The only thing which has changed here is the version of mysql2.

Another observation is that memory usage is going up on this node process similar to what @ddolcimascolo observed.

I wonder what @ddolcimascolo profiling would show?

ddolcimascolo commented 3 years ago

I can probably help with a profile, but not before next week as I'm currently off for a few days.

From what I understand of the discussion above, you think that 2.3.2 might be revealing bad application code on our side, triggering the 100% cpu? A frozen event loop or similar? Can you pinpoint a specific change in the driver that introduce it?

Thx

testn commented 3 years ago

I can probably help with a profile, but not before next week as I'm currently off for a few days.

From what I understand of the discussion above, you think that 2.3.2 might be revealing bad application code on our side, triggering the 100% cpu? A frozen event loop or similar? Can you pinpoint a specific change in the driver that introduce it?

Thx

That'd be nice. The key changes on 2.3.2 are:

  1. Caching column_definition attribute to make the lookup much faster. This code path is used a lot by Sequelize as it uses typecast all the time
  2. TextRow/BinaryRow now does not result TextRow/BinaryRow object directly to avoid exposing the internal representation of mysql2 unintentionally. It returns a new Object for each row instead to avoid hanging on to raw mysql buffer data
  3. string,decode code path now avoids calling slice() and then toString(). Instead, it calls toString() directly yielding some small improvement
testn commented 3 years ago

@ddolcimascolo Thanks! It would be nice if you can profile your application and send it over.

ddolcimascolo commented 3 years ago

Hi guys, I'm still on leave but my team is reporting that the CPU and memory usage is finally having an impact on the service... We might decide to revert to 2.3.0 until you find the problem.

Just letting you know

testn commented 3 years ago

Hi guys, I'm still on leave but my team is reporting that the CPU and memory usage is finally having an impact on the service... We might decide to revert to 2.3.0 until you find the problem.

Just letting you know

@ddolcimascolo can you ask your team to profile the system for you?

stavert commented 3 years ago

@testn I've sent you a new isolate log which doesn't contain any bcrypt.

testn commented 3 years ago

@stavert, your log looks normal to me. Do you still see 100% cpu pinned?

I also try to use group test to benchmark the differences between v2.3.2 and v2.3.0 and it seems to show a positive impact.

$ mocha test\integration\model\findAll\group.test.js <------------- 2.3.2

[MYSQL] Model findAll group √ should correctly group with attributes, #3009 (8963ms) √ should not add primary key when grouping using a belongsTo association (78ms)

2 passing (9s)

$ mocha test\integration\model\findAll\group.test.js <--------------- 2.3.0

[MYSQL] Model findAll group √ should correctly group with attributes, #3009 (10476ms) √ should not add primary key when grouping using a belongsTo association (89ms)

2 passing (11s)

stavert commented 3 years ago

Yup...still pinned:

image

stavert commented 3 years ago

And another snapshot a few mins later:

image

stavert commented 3 years ago

And another...just to capture (and you see memory usage going up also):

image

testn commented 3 years ago

@stavert you can add --trace-gc when you start up to see if the GC behavior is really different.

stavert commented 3 years ago

i will try that. in the meantime, this machine was not running in cluster mode. i wanted to mimic more of my prod configuration so switched over to cluster mode and, as you can see, all 4 cpus go to 100%:

image

Can you guide me specifically on any way where I could get a backtrace directly from the process? Please be as specific as you can as I don't do this much.

testn commented 3 years ago

I sent you the email with the instruction how to do so.

testn commented 3 years ago

One thing I notice is your query might actually have more than 300 columns right?

stavert commented 3 years ago

Our queries are large but I don’t think more than 300 columns.

From: testn @.> Sent: Friday, October 29, 2021 1:24 PM To: sidorares/node-mysql2 @.> Cc: Steve Mohl @.>; Mention @.> Subject: Re: [sidorares/node-mysql2] @.*** seems to run CPU to 100% (Issue #1432)

One thing I notice is your query might actually have more than 300 columns right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sidorares/node-mysql2/issues/1432#issuecomment-954915126, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM2OMTT2JWG2WNA2GRN6FYLUJLKEXANCNFSM5GQZY2PA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

stavert commented 3 years ago

@testn I didn't receive any email from you......

testn commented 3 years ago

try record your application profile with perf. It should yield more detail of what node is spending time on

https://www.yld.io/blog/cpu-and-i-o-performance-diagnostics-in-node-js//blog/cpu-and-i-o-performance-diagnostics-in-node-js/

ddolcimascolo commented 3 years ago

Hi guys, I'm still on leave but my team is reporting that the CPU and memory usage is finally having an impact on the service... We might decide to revert to 2.3.0 until you find the problem. Just letting you know

@ddolcimascolo can you ask your team to profile the system for you?

Sorry no.

Are you in a position to find the problem with all the details provided by @stavert? It looks like it's fairly easy to reproduce both the CPU and memory leaks... Isn't it time to start bissecting the changes between 2.3.0 and 2.3.2?

Thx

yuxizhe commented 3 years ago

@stavert @ddolcimascolo do you use ua-parser-js in your project?

https://github.com/faisalman/ua-parser-js/issues/536

https://www.rapid7.com/blog/post/2021/10/25/npm-library-ua-parser-js-hijacked-what-you-need-to-know/

For approximately 4 hours on Friday, October 22, 2021, a widely utilized NPM package, ua-parser-js, was embedded with a malicious script intended to install a coinminer and harvest user/credential information. This package is used “to detect Browser, Engine, OS, CPU, and Device type/model from User-Agent data,” with nearly 8 million weekly downloads and 1,200 dependencies.

ddolcimascolo commented 3 years ago

Hi @yuxizhe

Through karma, yes. But we got version 0.7.30 in the tree, so a safe version. Also it's a dev dependencies so it's not even installed on production.

Thx for reporting anyway

ddolcimascolo commented 3 years ago

Hi @testn @sidorares

I've made a little progress in that I managed to attach to a running instance stucked at 100% CPU and collected a few details

[1] image

[2]

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.71    0.091903          30      3079           epoll_wait
  0.11    0.000100           1       102           futex
  0.06    0.000059           1        43           read
  0.05    0.000043           1        37           write
  0.05    0.000042           1        38           mprotect
  0.01    0.000012           6         2           openat
  0.01    0.000006           0        30           epoll_ctl
  0.01    0.000005           3         2           close
  0.00    0.000003           0        11           getpid
  0.00    0.000000           0         1           writev
------ ----------- ----------- --------- --------- ----------------
100.00    0.092173                  3345           total
testn commented 3 years ago

which version of nodejs and Linux kernel do you use? the older version has a bug where it can cause epoll_wait to spin

ddolcimascolo commented 3 years ago

Node 14.15 on Ubuntu 18.04 (I can confirm the Kernel version later, but I'm pretty sure it's the 5.3 line)

testn commented 3 years ago

That should be new enough that it has fixed most of the epoll problem. Can you do perf record as per the instructions I posted for stavert earlier? I would like to see where it led to epoll_wait. It could be that now MySQL2 2.3.2 is fast enough to deplete the socket buffer and cause nodejs to wait for the data.

stavert commented 3 years ago

No ua-parser-js in use in our case. Thanks.

From: yuxizhe @.> Sent: Monday, November 1, 2021 5:04 AM To: sidorares/node-mysql2 @.> Cc: Steve Mohl @.>; Mention @.> Subject: Re: [sidorares/node-mysql2] @.*** seems to run CPU to 100% (Issue #1432)

@staverthttps://github.com/stavert @ddolcimascolohttps://github.com/ddolcimascolo do you use ua-parser-js in your project?

faisalman/ua-parser-js#536https://github.com/faisalman/ua-parser-js/issues/536

https://www.rapid7.com/blog/post/2021/10/25/npm-library-ua-parser-js-hijacked-what-you-need-to-know/

For approximately 4 hours on Friday, October 22, 2021, a widely utilized NPM package, ua-parser-js, was embedded with a malicious script intended to install a coinminer and harvest user/credential information. This package is used “to detect Browser, Engine, OS, CPU, and Device type/model from User-Agent data,” with nearly 8 million weekly downloads and 1,200 dependencies.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/sidorares/node-mysql2/issues/1432#issuecomment-956053996, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AM2OMTTVYPT2A7AFPP2CZQLUJZJWRANCNFSM5GQZY2PA. Triage notifications on the go with GitHub Mobile for iOShttps://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Androidhttps://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ddolcimascolo commented 3 years ago

Naaah. Node is all about async i/o, don't spin on the event loop, etc. I don't believe spiking at 100% CPU is something to expect...

Running the perf tool is more complicated for me, as I will run that on a non-production environment, but I can't do that right now. Will try to keep you posted later this week