gs1 / GS1_DigitalLink_Resolver_CE

The GS1 DigitalLink Resolver Community Edition
Apache License 2.0
38 stars 22 forks source link

MongoDB connection leak in resolver #73

Closed erik-wramner closed 1 month ago

erik-wramner commented 9 months ago

We are running the resolver behind a load balancer that uses GET requests to check that it is up. When the system starts, this works fine:

Mon, 11 Dec 2023 09:01:15 GMT >> identificationKeyController: Found the dbDocument for requested URL /07322975634978/ Mon, 11 Dec 2023 09:01:15 GMT >> identificationKeyController: Found the dbDocument for requested URL /07322975634978/ Mon, 11 Dec 2023 09:01:15 GMT >> identificationKeyController: Found the dbDocument for requested URL /07322975634978/

However, after some time we get MongoDB errors:

Mon, 11 Dec 2023 08:50:22 GMT >> identificationKeyController: Found the dbDocument for requested URL /07322975634978/ Mon, 11 Dec 2023 08:50:22 GMT >> Exception occur in findDigitalLinkEntryUsingIdentifier method Mon, 11 Dec 2023 08:50:22 GMT >> MongoServerSelectionError: connection timed out Mon, 11 Dec 2023 08:50:22 GMT >> identificationKeyController: Found the dbDocument for requested URL /07322975634978/ Mon, 11 Dec 2023 08:50:24 GMT >> Exception occur in findDigitalLinkEntryUsingIdentifier method Mon, 11 Dec 2023 08:50:24 GMT >> MongoServerSelectionError: connection timed out

Looking at the active connections to MongoDB, each side of the cluster has about 2k connections even though traffic is very low (consisting of the load balancer ping):

image

The graphs went down when I killed the resolver. It will work itself up to the original figures again.

My take on this is that there is a connection leak, so that MongoDB connections are not released and eventually we run into the limit on allowed active connections. This really must be solved or the resolver dies within hours every time.

EDIT: here is the graph some time later, notice how connections keep increasing. image

EDIT: and the connections keep going up. image

EDIT: https://stackoverflow.com/questions/71632524/mongodb-connections-dont-close-after-query seems to be the same error. I can note that we never close connections on errors (the close should ideally be in a finally block to run even if there are exceptions), but we don't get exceptions... so it must be something wrong in the regular happy flow.

erik-wramner commented 9 months ago

I think we have found the problem. It appears that the code is creating new MongoClients everywhere. There should only be one, as it is basically a connection pool. On the other hand, at least findPrefixEntry fails to close the connection (connClient) obtained from the pool - instead, it closes the pool, but that will not close connections that are still in use. Thus we have a leak.

Ideally we should rewrite to use a single global MongoClient and to close all connections (returning them to the pool) in finally blocks.

nicklansley commented 9 months ago

Thanks for highlighting this issue, I have updated the code to fix:

I've tested and updated the code. I used ab (Apache Bench) and it does seem to be pretty fast now!

I made the change to: resolver_web_server/src/db.js

The changes are:

  1. Added a 'global' database connection mongoConnection
  2. Added a 'global' mongoClient
  3. Added a new database access variable 'db'
  4. Wrote a mongoClient.open() event that sets 'db' to point to the correct Mongo database 'gs1resolver'
  5. Added a cleanup() function to close the database cleanly on a SIGTERM/SIGINT event (better than just abandoning the connection on app shutdown!)
  6. Simplified the calling functions in db.js to access the db variable and not closing the connection
nicklansley commented 1 month ago

Apologies for this I am now focussed on Resolver v3.0 and I have stopped maintenance on v2.6. I hope you find the far simpler v3.0 to your liking!