neo4j / graphql

A GraphQL to Cypher query execution layer for Neo4j and JavaScript GraphQL implementations.
https://neo4j.com/docs/graphql-manual/current/
Apache License 2.0
510 stars 150 forks source link

`Neo4jGraphQLSubscriptionsCDCEngine` with enabled CDC@Aura kills the neo4j-graphql server whenever massive data changes happen within a short period of time #5787

Open andreloeffelmann opened 1 week ago

andreloeffelmann commented 1 week ago

Describe the bug Since API v6 the Neo4jGraphQLSubscriptionsDefaultEngine is deprecated, so we replaced it with the only current available one and enabled CDC with mode=DIFF at our Aura instance. This works totally fine as long as there are no massive changes within the database. But: we have scheduled tasks that import data into our aura instance every day. The amount of data varies, but it is not uncommon that millions of nodes are (re-)inserted into the database. This causes a LOT of CDC events which propagate back to our neo4j-graphQL server - which dies and shuts down. This even happens at my local machine where a lot of CPU and RAM is available, so this seems not to be a bottleneck here. I tried different configurations for pollTime reaching from 100ms to 5000ms but this seems to have no effect on the problem here - the server dies either way.

The thing is: we do not need the CDC events from the database. We only need subscriptions for change events happening within the neo4j-graphQL server - exactly what Neo4jGraphQLSubscriptionsDefaultEngine was doing. Since that was dropped, the only current working solution for us now is to disable subscriptions in total.

On the other hand we have some applications which rely on the subscription-functionality - these apps do not work anymore.

We definitely do NOT want to stay on v5 since we aim to be up-to-date with the API at all time.

So, what are our options? Do you see a way to re-enable something like Neo4jGraphQLSubscriptionsDefaultEngine again in API v7? Can we somehow mimic the behaviour of this engine by ourselves and pass it to Neo4jGraphQL?

By the way: to enable CDC in our situation feels really bad since it massively blows up the transaction log with CDC events no one needs.

andreloeffelmann commented 1 week ago

To add an idea: I do not know if that's possible, but wouldn't it be good to: for each polling, query only for those CDC events within the database for which subscriptions are currently active. This would reduce the amount of transfered data massively

darrellwarde commented 1 week ago

But: we have scheduled tasks that import data into our aura instance every day. The amount of data varies, but it is not uncommon that millions of nodes are (re-)inserted into the database.

Currently in a meeting with the team discussing this so will reply in full later. But wanted to check on this - are these millions of daily import events related to nodes which are also defined in your GraphQL type definitions, or node types which are unrelated to your GraphQL API? Thanks for the info and quick replies!

andreloeffelmann commented 1 week ago

They are related to the API and exist within the GraphQL type definitions

darrellwarde commented 1 week ago

Okay! Thank you very much for raising this issue and your continued commitment to improving this library with really well written issues, we do appreciate it!

We have discussed this and have come up with the following plan:

  1. In Version 6, we will investigate filtering by metadata to see if we can add an option to only consume events which were caused by the GraphQL Library - if this does prove fruitful then I believe this will resolve your issue here.
  2. As a future optimization, we will select either only labels for which subscriptions have been enabled, or better yet, the labels which currently have active subscriptions.

Do you see a way to re-enable something like Neo4jGraphQLSubscriptionsDefaultEngine again in API v7?

On this one, this is a hard no I'm afraid! This engine required a massive amount of crazy Cypher which we also couldn't implement in all cases - sometimes events would be happening in GraphQL but not being raised due to the difficulty of capturing them in Cypher. This CDC approach is so many times more reliable and makes it a lot easier to work on Cypher generation in the library.

By the way: to enable CDC in our situation feels really bad since it massively blows up the transaction log with CDC events no one needs.

On this one, I would strongly recommend looking into transaction log retention settings if you haven't already. I was just chatting about this with one of the kernel engineers earlier who gave me this info:

For something like GraphQL, it probs depends more on what they expect the average user session to be. If they using the subs for live-updates, then the retention is less of an issue as the client should be pushing out updates pretty much soon as they occur The longer retention is more for long running polling apps (like ETL pipelines) where if they go down, they can be restarted and expect to pick up from their last update For GraphQL, I imagine the reset is more from some initial queries and then getting deltas again

I hope all of the above is generally good news and helpful for you!