Connections exhausted & not returning to pool all of a sudden

isaackogan commented 1 year ago

Hello!

This is not a bug report, but a general question asking for help. I do not know how to best diagnose or where to go regarding my issue, and would appreciate any help that could be offered

Primary Issue

I am running a Node.JS application based on a Socket.IO instance. Data is being pumped through to Socket.IO (the app), and then MSSQL is taking the received data and writing it to a database.

My issue is that scaling up the program to 300 concurrent Socket.IO connections, all feeding data, results in a bottleneck with database queries. This results in a non-descriptive timeout error:

Error: operation timed out for an unknown reason
    at C:\Users\jeanm\WebstormProjects\TikTokResearchNode\node_modules\tarn\dist\PendingOperation.js:17:27
    at runNextTicks (node:internal/process/task_queues:61:5)
    at listOnTimeout (node:internal/timers:528:9)
    at processTimers (node:internal/timers:502:7)

My Queries

Here are my queries that my project is making use of:

```js export const SQL_USER_ID_QUERY = (` SELECT id FROM users WHERE unique_id=? `); export const SQL_ROOM_EXISTS_QUERY = (` SELECT room_id FROM streams WHERE room_id=? `); export const SQL_IMAGE_BADGE_ID_QUERY = (` SELECT id FROM image_badges WHERE url=? `); export const SQL_TEXT_BADGE_ID_QUERY = (` SELECT id FROM text_badges WHERE label=? `); export const SQL_INSERT_SNAPSHOT_IMAGE_BADGE_QUERY = (` INSERT INTO snapshot_image_badges (user_snapshot_id, image_badge_id) VALUES (?, ?) `); export const SQL_INSERT_SNAPSHOT_TEXT_BADGE_QUERY = (` INSERT INTO snapshot_text_badges (user_snapshot_id, text_badge_id) VALUES (?, ?) `); export const SQL_INSERT_TEXT_BADGE_QUERY = (` INSERT INTO text_badges (label, name) OUTPUT INSERTED.id VALUES (?, ?) `); export const SQL_INSERT_IMAGE_BADGE_QUERY = (` INSERT INTO image_badges (url, label, name) OUTPUT INSERTED.id VALUES (?, ?, ?) `); export const SQL_INSERT_USER_QUERY = (` INSERT INTO users (unique_id) OUTPUT INSERTED.id VALUES (?) `); export const SQL_INSERT_MIC_BATTLE_QUERY = (` INSERT INTO mic_battle (room_id, status) OUTPUT INSERTED.id VALUES (?, ?) `); export const SQL_INSERT_ARMY_SNAPSHOT_QUERY = (` INSERT INTO army_snapshots (mic_battle_id, host_user_id, points) OUTPUT INSERTED.id VALUES (?, ?, ?) `); ```

There is one additional query in the spoiler below.

Additional Query

```js const result = await this.pool.query` INSERT INTO snapshots ( user_id, room_id, tiktok_id, nickname, is_following, is_friend, is_new_gifter, is_moderator, is_subscriber, top_gifter_rank, gifter_level, following_count, follower_count ) OUTPUT INSERTED.id VALUES ( ${user_id}, ${room_id}, ${user["user_id"]}, ${user['nickname']}, ${parseBoolAsInt(user["is_following"])}, ${parseBoolAsInt(user["is_friend"])}, ${parseBoolAsInt(user["is_new_gifter"])}, ${parseBoolAsInt(user["is_moderator"])}, ${parseBoolAsInt(user["is_subscriber"])}, ${user["top_gifter_rank"] == null ? null : parseInt('top_gifter_rank')}, ${user['gifter_level']}, ${user?.["info"]?.["following"]}, ${user?.["info"]?.["followers"]} ) `; ```

Note that there are also several dynamically generated queries. These all follow the generation schema outlined below:

Query Generator

```js static #build_query(table_name, keys = [], return_id = false) { return ` INSERT INTO "${table_name}" (${keys.join(', ')}) ${return_id ? 'OUTPUT INSERTED.id' : ''} VALUES (${"?".repeat(keys.length).split("").join(", ")}) ` } ```

My Database

The following is my database design:

IVDgUIv

What I've Tried

I read that increasing the connection pool is a valid strategy, and for me, with about 1500 queries per second if things worked the way they should, the pool having a max of 10 was obviously not enough.

This is my config:

    {
        connectionString: 'Driver=ODBC Driver 17 for SQL Server;Server=localhost;Database=ResearchTest;Trusted_Connection=yes;',
        driver: 'msnodesqlv8',
        pool: {
            max: 16384,
            min: 256,
            idleTimeoutMillis: 30000
        }
    }

I just went for it, and gave it everything, to see if it was a matter of not enough connections. Well, surprise surprise, no, it's not.

What I've Debugged

Debugging what was happening showed me that the pool was steadily filling up to 5-10 "free"connections:

This would stay stable for about 30 seconds. Then, within 5 seconds of the last screenshot, it spiked:

What I believe this is telling me is that connections stopped being returned to the pool, so node-mssql tried to combat this by creating new connections, except those were also not being returned. And then, this just continued for a long time since the 'max' was 16384. Eventually, after 30 seconds, the timeout errors start being thrown.

What I Think The Cause Is

I do not have a lot of experience with MSSQL (in fact, the last few weeks are my first). In general, I am not very skilled with SQL, but my project requires it and I am not afraid of a learning curve.

My observation is that connections are not being returned to the pool, and this causes timeouts. My hypothesis is that this would happen if a table is locked and hangs, though I am not sure.

An alternative hypothesis would be that really expensive queries cause a build-up that exhausts all connections.

My Questions

Essentially, I'm curious what people with more experience think. What are the potential reasons for what I've outlined is happening, and do you have any ideas what I can do to fix it? I know it is not an issue with your library, but my code, and I am willing to provide any information if you feel that you can help.

Is this a case of a table lock gone wrong? How would I diagnose and go about fixing that? If it's possible it's something else, what, and how can I find out?

I am also unsure how to debug something like this when there are thousands of queries interacting with the DB at a given time, how would I pin down the 1 where everything goes wrong to even KNOW what has gone wrong?

Any help/a direction to go in would be greatly appreciated!

isaackogan commented 1 year ago

Asked on StackOverflow as well, as I am not sure which is preferable (I never really ask questions, there's always an SO answer!): https://stackoverflow.com/questions/75658222/sql-server-connections-exhausted

dhensby commented 1 year ago

Thanks for your clear and detailed description of the problem. Usually, I try to keep the issues in GitHub for bugs, but I’m happy to try and help where I can.

The error your seeing is coming from Tarn.js, which is the library that we use under the hood to manage the resource (connection) pooling, which means this is almost certainly a timeout in creating a new connection and not a timeout in fetching an existing connection.

I think that the problem you have is connection/socket limits.

In Node JS there is a limit to how many open sockets Node can handle at a time; I believe this limit is determined by OS settings and other limitations (but can also be configured directly - you’ll have to do some searching). Basically, Node has an internal limit for how many open sockets it has, that is any type of socket, incoming/outgoing HTTP, WebSocket, sockets to SQL server, redis, etc, etc. If that limit is hit, the next socket that needs to be opened will wait until another is closed.

I suspect this is what is happening for you because you’re hitting Tarn’s create timeout instead of having the connection forcibly reset (ECONNRESET error).

What can you do to mitigate this? The best thing to do is to pick a sensible max pool size (and 16,384 is not it ;) ). Firstly, you need to know what the theoretical maximum size should be; this will be determined by limits on your server. Every SQL server will have a configured maximum number of connections it will allow at any given time (eg: Azure SQL list them here) - your DB Admin should be able to tell you this limit. This number forms the basis of the maximum you can use, but it is not the number you want to use because you don’t want to exhaust the connection limit with one pool. If your application runs on only one server on only one instance, then you will likely only have one pool, so you can use a number of say 80% of the max connections, but if you have many servers load balanced or you run more than one node process at a time and load balance those, then each process will have its own pool and the sum of all the pools shouldn’t be exceeding your theoretical maximum.

From the look of your debug screenshot you have 89 in use connections and then a lot waiting to be created; those in use connections will be freed up (I hope!), but by having such a high max size, the connection pool is deciding to try to create a new one instead of wait for one to be freed, thus causing a stack of never-to-connect new connections to be created, hang, and reject. I would suggest starting at a max of 50 from what you’ve shown above.

If the problem is the number of sockets node is setup to handle, you’re going to come to a problem where if your max pool size for the connection pool is > 50% of the sockets node can handle, then you’ll likely still run into this problem because if (for arguments sake) 1 connection in the pool can facilitate 5 web socket connections, you’ll run out of total sockets available to node before you reach the maximum size and thus hit the same timed-out connection create calls.

Once you’ve solved that issue, if you find that you are still running into contention issues, I’d suspect the next problem you have is table locking… Because SQL Server is an ACID compliant DBMS, each insert will temporarily lock the database table whilst it writes the rows. This means your write requests will essentially be running in sequence and not parallel and, when under high load, you’ll start to see requests queue including requests to the pool. Under certain instances you may then start to see timeouts if loads become very high and the database can’t keep up.

If/when you hit that problem, you’re going to need to look into more scalable solutions, be it read-replicas (because inserts also block selects) or slightly more performant queries (maybe you can batch some of your queries into stored procedures), or deferred db inserts (say, putting the data into a redis cache which then gets picked up by a worked and inserted into the database in batches). Ultimately, I suspect that an SQL database is not going to scale too well if you are running a single master server.

tl;dr I suspect the problem is not (yet) table locking, but is maximum socket count that Node can handle. Pick a more sensible max pool size (possibly 50).

Finally, a couple of tips:  

By default this library automatically performs a “ping” request for connections before releasing them (SELECT 1), I suspect this is not going to be useful for a high load system like yours and will just cause latency, so disable it using the validateConnection: false config option.
Tarn's behaviour, despite what you might expect, is that it doesn’t primitively create connections to hit the desired minimum value. The minimum value is simply where it stops discarding resources. You may wish to "warm up" your connection pool on application start by forcing concurrent requests through the pool to get up to the minimum number. This would mean that you’ve “reserved” those sockets for the connection pool instead of having them taken by incoming web sockets. I’m not sure how much help it will really have, though - and will just push the problem to the web socket layer, meaning clients won’t be able to connect to you rather than you failing to connect to the database - you’ll still need to address the total socket connections somewhere.

isaackogan commented 1 year ago

There is so much useful information here. Thank you! I'm going to try your suggestions today! Cannot underplay how happy I am to have received this help

isaackogan commented 1 year ago

HELLO! Thank you for your suggestions. My project is now running with 300 concurrent users/Socket.IO connections.

I made the following changes:

Setting validateConnection: false config option
Halve the number of 'requests' to the database by combining queries
Reduce the max pool to 50 (knowing how connections work now, it's funny I had it so large)

I am glad that I have 'extra' cards to pull if there is a scale issue in the future. These include (for my own self-reference):

Warming up the pool
Using Redis as temporary storage for batching
Sharding the storage script (1 data collection script -> n # of instances of the Node.JS storage script)

The bottleneck was absolutely sockets and not table locking. Thank you @dhensby for your help, again. You saved me so much struggle, lol

Take care!

dhensby commented 1 year ago

Great, I'm glad you got to the bottom of it

tediousjs / node-mssql