pietermartin / sqlg

TinkerPop graph over sql
MIT License
243 stars 51 forks source link

Exceptions when adding vertices/edges that do on-demand schema creation #338

Closed ktschmidt closed 3 years ago

ktschmidt commented 5 years ago

When using our full application which is processing a stream of messages in a multi-threaded fashion, if vertices/edges/properties are added that require schema creation, we have seen cases where we get deadlocks with MySQL/MariaDB. We've taken steps to try to validate requests and not allow on-demand creation, but we've still seen cases of deadlocks timing out.

Here is an example of the stacktrace we see:

20190313 08:07:57.722 [ERROR] Camel (ngms-mmsender-context) thread #67 - JmsConsumer[ngmsEventRegistryEnrichedFeed] | 413:com.nextgate.graph.ngg-ngms-feed | com.nextgate.graph.ngmsfeed.Feed | Exception occurred processing message java.lang.RuntimeException: Timeout lapsed to acquire write lock for notification. at org.umlg.sqlg.structure.topology.Topology.z_internalSqlWriteLock(Topology.java:619) at org.umlg.sqlg.structure.topology.Topology.lock(Topology.java:584) at org.umlg.sqlg.structure.topology.VertexLabel.ensurePropertiesExist(VertexLabel.java:411) at org.umlg.sqlg.structure.topology.Schema.ensureVertexLabelExist(Schema.java:179) at org.umlg.sqlg.structure.topology.Schema.ensureVertexLabelExist(Schema.java:156) at org.umlg.sqlg.structure.topology.Topology.ensureVertexLabelExist(Topology.java:763) at org.umlg.sqlg.structure.SqlgGraph.addVertex(SqlgGraph.java:382) at org.umlg.sqlg.step.SqlgAddVertexStartStep.processNextStart(SqlgAddVertexStartStep.java:61) at org.umlg.sqlg.step.SqlgAbstractStep.next(SqlgAbstractStep.java:106) at org.umlg.sqlg.step.SqlgAbstractStep.next(SqlgAbstractStep.java:17) at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.next(DefaultTraversal.java:200)

This is not easily reproducible since it is in context of our full application, but in an effort to reproduce it, I wrote a simple client that adds vertices/properties/edges, some adds that will cause on-demand schema creation, and running multiple threads to try to cause problems to occur.

The client is available at https://github.com/ktschmidt/sqlgstuff in the populatetest module. Included there is an output file from using MySQL/MariaDB with the following exception:

Exception in thread "Thread-5" java.lang.IllegalStateException: Edge vertex 'prior' does not exist in schema 'PUBLIC' at com.google.common.base.Preconditions.checkState(Preconditions.java:817) at org.umlg.sqlg.structure.topology.TopologyManager.addLabelToEdge(TopologyManager.java:695) at org.umlg.sqlg.structure.topology.EdgeLabel.ensureEdgeVertexLabelExist(EdgeLabel.java:768) at org.umlg.sqlg.structure.topology.Schema.internalEnsureEdgeTableExists(Schema.java:344) at org.umlg.sqlg.structure.topology.Schema.ensureEdgeLabelExist(Schema.java:251) at org.umlg.sqlg.structure.topology.Topology.ensureEdgeLabelExist(Topology.java:836) at org.umlg.sqlg.structure.topology.Topology.ensureEdgeLabelExist(Topology.java:809) at org.umlg.sqlg.structure.SqlgVertex.addEdgeInternal(SqlgVertex.java:171) at org.umlg.sqlg.structure.SqlgVertex.addEdge(SqlgVertex.java:137) at com.nextgate.test.sqlgtest.CreateData.run(CreateData.java:47) at java.lang.Thread.run(Thread.java:748)

Followed by the following for each other thread.

Exception in thread "Thread-9" java.lang.RuntimeException: Timeout lapsed to acquire write lock for notification. at org.umlg.sqlg.structure.topology.Topology.z_internalSqlWriteLock(Topology.java:619) at org.umlg.sqlg.structure.topology.Topology.lock(Topology.java:584) at org.umlg.sqlg.structure.topology.VertexLabel.ensurePropertiesExist(VertexLabel.java:411) at org.umlg.sqlg.structure.topology.Schema.ensureVertexColumnsExist(Schema.java:488) at org.umlg.sqlg.structure.topology.Topology.ensureVertexLabelPropertiesExist(Topology.java:902) at org.umlg.sqlg.structure.SqlgElement.property(SqlgElement.java:239) at org.umlg.sqlg.structure.SqlgVertex.property(SqlgVertex.java:211) at com.nextgate.test.sqlgtest.CreateData.run(CreateData.java:43) at java.lang.Thread.run(Thread.java:748)

I'm not sure why the first problem occurs, but once it does the other threads all deadlock. So perhaps two separate issues.

The above is somewhat repeatable by just running the client, and is run from a completely clean database, e.g. no sqlg_schema or gui_schema and an empty public database.

I also ran the client with PostgreSQL and the same problems don't occur, but an odd NullPointerException did occur:

Exception in thread "Thread-9" java.lang.NullPointerException at org.umlg.sqlg.structure.SqlgEdge.internalAddEdge(SqlgEdge.java:169) at org.umlg.sqlg.structure.SqlgEdge.insertEdge(SqlgEdge.java:134) at org.umlg.sqlg.structure.SqlgEdge.(SqlgEdge.java:57) at org.umlg.sqlg.structure.SqlgVertex.addEdgeInternal(SqlgVertex.java:175) at org.umlg.sqlg.structure.SqlgVertex.addEdge(SqlgVertex.java:137)

pietermartin commented 5 years ago

Thanks, I'll have a look at your test program. Are there any write transactions that take longer than 2 minutes? Sqlg's lock mechanism is more conservative since #329 . Sqlg does not allow making topology changes while a write thread is in progress as it could lead to dead locks. If a write thread is in progress it will block for 2 minutes before throwing the java.lang.RuntimeException: Timeout lapsed to acquire write lock for notification. So if this is the case I'll advice commit more frequently to allow other threads to get the lock or increase the timeout using lock.timeout.minutes property in sqlg.properties. Still another way is to have retry logic, but alas this is a pain.

ktschmidt commented 5 years ago

No, all write transactions should be quick, nothing complicated, either in our real app or this test client. It is just adding a vertex or edge and a property or two.

We actually are committing fairly frequently, you can see the test client is not batching up any of the adds.

pietermartin commented 5 years ago

Had a look. I can not get postgresql to not work. I ported your test to Sqlg so perhaps have a look if it fails by you on postgresql. However MariaDB hangs everytime, even with only 3 threads. Not that familiar with MariaDB but will keep looking to see what's going on.

pietermartin commented 5 years ago

Ok, made some progress but no solutions yet.

java.lang.IllegalStateException: Edge vertex 'prior' does not exist in schema 'PUBLIC' happens I think because of MVCC semantics. 2 transactions run at the same time. transaction 1 created the edge 'prior' and commits which puts it into Sqlg's global cache. Transaction 2 now wants to do something to edge 'prior'. Sqlg says it exist but due to MVCC transaction 1 can not see it as it was not there when the transaction started. Secretly committing and starting a new transaction after taking the topology lock resolves the issue but is kinda unsatisfactory as it violates the users transaction boundaries. This is a rather serious issue. My first concern is why I have never seen this happening before on Postgresql as it too has MVCC semantics. Kinda late to only realize this now.

For the part where MariaDB hangs, its because it tried to execute an ALTER STATEMENT which is waiting for another write thread that executed an UPDATE to commit . The write thread is also trying to do a schema modification and can not get the topology lock as its taken by the ALTER STATEMENT's thread. Boom bam dead lock.

There is code to throw a deadLockDetected exception but its only implemented for posgresql at present. I'll try to implement the same on MariaDB.

As an aside the ALTER STATEMENT does not block for the UPDATE if you disable foreign keys. I imagine this is why its not happening on postgresql as its locking is probably finer grained with respect to foreign keys.

ktschmidt commented 5 years ago

I did discover that due to another issue of our own, on-demand schema creation was occurring that shouldn't have, so fixing that which should at least mitigate the effects of this issue, but thanks for the update. I'll share more as we observe anything new/interesting, but it seems the safest approach is to try to get the schema created up front using Topology so trying to do locks for the CREATE/ALTER statements isn't necessary.

On Sun, Mar 17, 2019 at 12:15 PM pietermartin notifications@github.com wrote:

Ok, made some progress but no solutions yet.

java.lang.IllegalStateException: Edge vertex 'prior' does not exist in schema 'PUBLIC' happens I think because of MVCC semantics. 2 transactions run at the same time. transaction 1 created the edge 'prior' and commits which puts it into Sqlg's global cache. Transaction 2 now wants to do something to edge 'prior'. Sqlg says it exist but due to MVCC transaction 1 can not see it as it was not there when the transaction started. Secretly committing and starting a new transaction after taking the topology lock resolves the issue but is kinda unsatisfactory as it violates the users transaction boundaries. This is a rather serious issue. My first concern is why I have never seen this happening before on Postgresql as it too has MVCC semantics. Kinda late to only realize this now.

For the part where MariaDB hangs, its because it tried to execute an ALTER STATEMENT which is waiting for another write thread that executed an UPDATE to commit . The write thread is also trying to do a schema modification and can not get the topology lock as its taken by the ALTER STATEMENT's thread. Boom bam dead lock.

There is code to throw a deadLockDetected exception but its only implemented for posgresql at present. I'll try to implement the same on MariaDB.

As an aside the ALTER STATEMENT does not block for the UPDATE if you disable foreign keys. I imagine this is why its not happening on postgresql as its locking is probably finer grained with respect to foreign keys.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pietermartin/sqlg/issues/338#issuecomment-473705477, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHUDlEQrlxJRpjM9ea4CvJTrm34KNYUks5vXpRFgaJpZM4bxb4K .

pietermartin commented 3 years ago

I am planning on removing all lock mechanisms and tricks from Sqlg. To get it to work the granularity is to course grained. Schema creation can only proceed in one thread at a time and when no other thread is writing.

So the idea to protect the user from dead locks did not really work. The pattern going forward is to use one thread for lazy creation or the schema creation api.

I am going to close this for now but open it again if you want to.

ktschmidt commented 3 years ago

@pietermartin are you removing all locks entirely? Or just those related to schema creation?

I want to make sure I understand the behavior when Sqlg is used from multiple threads or even multiple nodes.

And we do use the schema creation API and try to avoid lazy creation, but what would happen if multiple threads try to do lazy creation now?

pietermartin commented 3 years ago

Yes, the plan to remove all attempts by Sqlg to take locks.

In my day job where we manage many large Sqlg databases the locking strategy has not really worked out. Its not that there are bugs but rather that its very slow, forcing everything into one thread, and often occasionally throws Timeout exceptions if it can't get the lock.

To overcome this we now pre-create the schema object using the topology api. This works fine but because of Sqlg's locking strategy this still only happens in one thread even is you are creating unrelated schema objects.

Without any locking strategy the user will be free to create schema objects in many threads but is responsible for understanding postgresql/rdbms locks and will have to manually handle dead lock exceptions.

When I get to this I'll see if there is any deadlock detection that Sqlg can maybe do.

I you have ideas coming from your experience then please let me know as I have not started on this yet. Want to squash the outstanding bugs first.

ktschmidt commented 3 years ago

We use the schema creation API because we do want to avoid any complications of locks or contention for lazy creation as we have high volume transactions at times that may be stepping on each other trying to add new properties to the schema. Thus, the locking is just a failsafe should something slip through our schema creation.

What would be useful is perhaps an option to disable lazy creation so we have to use the schema creation API and can avoid the issue of locking/contention of multiple threads modifying the schema. Is this something reasonably straight-forward to add?

pietermartin commented 3 years ago

It should be fairly easy to disable lazy creation.

Just to be clear it will be up to the client to ensure that using the schema creation api does not lock. So either just force it to be in one thread or as in our case we have disparate parts of the system which can run in parallel as there is no risk of locking.

ktschmidt commented 3 years ago

Yep, that is. fine. We do the schema creation from a single node/thread as part of config/setup.