liberation-data / drivine

Best and fastest graph database client (Neo4j, AgensGraph, Apache AGE, FalkorDB) for Node.js & TypeScript.
https://drivine.org
Apache License 2.0
152 stars 33 forks source link

Neo4j connections are not properly released when using @Transactional and concurrency #116

Closed vongruenigen closed 1 year ago

vongruenigen commented 1 year ago

Hey there :)

We're using drivine in combination with neo4j for all our backend software at the company I'm working at. We recently started to run into issues with connections not being properly released when using the @Transactional decorator and concurrency, i.e. running multiple database queries simultaneously. The problem is that it seems that connections used within a @Transactional context are not released properly and this leads to a problem after a number of requests, because the connection acquisition times out in the neo4j driver.

I've created a minimal example by using the drivine-inspiration template and adding some concurrency by running a database query 20 times in the RoutesController (see line 23-25 in there). The repository can be found here: https://github.com/vongruenigen/drivine-inspiration-connection-release-issue

You can trigger the problem the following way:

  1. Start the nestjs server in the repo mentioned above.
  2. Query the /routes/between/:start/:dest endpoint multiple times one after the other (delay between doesn't really matter, tried it in a quick succession as well with 30s in between the request).
  3. The 6th request will fail with the error message that the connection acquisition failed after 60s because no free connections are available.

It seems that the connections of previous requests are not properly released when using the @Transactional decorator and we're not sure what we're doing wrong. Since I used the drivine-inspiration starter template for the repository above I assume that this is a general problem in drivine itself and not the way we're using it.

Thanks in advance!

vongruenigen commented 1 year ago

We did some more investigation, and it seems that the issue is within the Transaction class: When running multiple queries concurrently, there will be new connections created within the connectionFor method because the access to the connectionRegistry map is not synchronized via a mutex or similar.

We have a quickfix for this particular case (using the async-mutex library), but only for the connectionFor method, see below. The problem with unsynchronized concurrent access might be a problem in other parts of the library as well, we haven't checked this yet.

async connectionFor(database) {
    const release = await connectionMutex.acquire();
    try {
        if (!this.connectionRegistry.get(database)) {
            const databaseRegistry = this.contextHolder.databaseRegistry;
            const connectionProvider = databaseRegistry.connectionProvider(database);
            if (!connectionProvider) {
                throw new DrivineError_1.DrivineError(`There is no database registered with key: ${database}`);
            }
            const connection = await connectionProvider.connect();
            this.connectionRegistry.set(database, connection);
            await connection.startTransaction();
        }
        return this.connectionRegistry.get(database);
    } catch {}
    finally {
        release();
    }
}
jasperblues commented 1 year ago

Thanks for the report and analysis Dirk. I'm sorry that you experienced this issue.

I'll apply the mutex solution on the connectionFor method straight away and publish a new release.

Following this, let's also consider if there are other places with unsyncrhonized "get or create" of resources might cause issues.

vongruenigen commented 1 year ago

Thanks for swift reaction! And no worries, happens to the best of us ๐Ÿ˜„

I'll do some digging over the weekend and next week as well to figure out if there are other places there concurrency might lead to issues.

Another possible solution for this particular case we thought about is that the library could use one connection per query, i.e. using an array of connections rather than a single one in the connectionRegistry. This might lead to other issues though (e.g. too many connections used at the same time).

jasperblues commented 1 year ago

@vongruenigen For transactions to behave as expected its necessary for one transaction to have one "connection"

(As you know, a connection is a wrapper around a Neo4j session, in the case of Neo4j. And its the session that provides an interface to begin, commit or rollback transactions and to issue statements in between those events).

I think your mutex locking is ideal and should have negligible performance impact.

--

I believe there might be a similar issue in the non-transactional (without the decorator) version, but not sure where.

However, let's apply fixes separate as they're found.

Sorry that I haven't worked on it yet, but will see if I can work on it tomorrow (GMT+10 : Australia).

vongruenigen commented 1 year ago

Great, thanks a lot for the update, looking forward to the new release! ๐Ÿ‘

vongruenigen commented 1 year ago

@jasperblues wanted to quickly follow up and ask when you plan to release the new version? We'd need the fix latest by the end of this week. If you're too busy I could open a PR from my own fork if you'd like.

jasperblues commented 1 year ago

HI Dirk. I'm sorry sorry that I haven't gotten to this yet.

Yes that would be great if you could open a PR. I will get it merged and a new version published immediately.

Thank you for your very collaborative open-source spirit.

vongruenigen commented 1 year ago

Ok, I'll open a MR later today ๐Ÿ‘

jasperblues commented 1 year ago

Merged the PR, thanks again Dirk.

As discussed we may need to do this for the non-transactional context and check for other places this issue manifests. Will leave the issue open for that.

In the meantime a new version 2.5.0 was published that contains the txn fix.

vongruenigen commented 1 year ago

Great, thanks a lot for publishing the new version! I'll keep it on my list as well to look at the non-transactional and potential other places where this issue might arise.

jasperblues commented 1 year ago

Stale issue - MR was merged.