Causal Clusters Failing

creatzor commented 7 years ago

We're using uuid-3.1.0.44.13.

Plugin Config:

dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware com.graphaware.runtime.enabled=true com.graphaware.module.UIDM.1=com.graphaware.module.uuid.UuidBootstrapper

We're unable to get a causal cluster to run with UUID enabled, with it disabled it works fine. It seems like UUID is trying to write to the database through a follower before the database is ready.

Leader starts fine, Followers fail with a write error before the remote interface is enabled (localhost:7474). This happens with a completely empty database too.

I also notice that CALL ga.uuid.findNode("uuid") yield node as n also causes a write.

leader follower

ikwattro commented 7 years ago

Thanks for the report @albert-the-creator . We will investigate but I'll just inform you that you should expect a bit more time needed for us to fix it, not because of holidays but because we are under heavy workload right now.

So as soon as we can we'll address this issue.

Thanks and happy end 2016/new 2017

creatzor commented 7 years ago

@ikwattro I see, its that time of the year for your team.

Would we be able to use uuid with HA instead of causal clusters?.. or would you say that the same issue would persists? Its just that the UUIDs play a very integral part of our app so any possible solution is a good solution for us at this point (as i'm sure it'll be for everyone else that would like to scale their neo app).

Happy New Year!!

bachmanm commented 7 years ago

Hi Albert, with HA it works just fine. We'll keep you updated with regards to the fix for causal clusters.

creatzor commented 7 years ago

@omarlarus any idea on when a new release with this fix will be sent?

creatzor commented 7 years ago

log.txt

@bachmanm was this fix deployed in the latest release? We still get the same issue.

creatzor commented 7 years ago

dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware com.graphaware.runtime.enabled=true com.graphaware.module.UIDM.1=com.graphaware.module.uuid.UuidBootstrapper

given 3 nodes: node-1: Looking for other members of the cluster node-2: Starts correctly node-3. give us that log ^^ and starts

bachmanm commented 7 years ago

Yes, should be fixed in the 3.1.3-compatible .jars.

creatzor commented 7 years ago

@bachmanm hmm. We're still experiencing that issue..maybe we're using a bad config, but we're currently out of options on what it could be. We can sucessfully run the cluster without the graphaware plugins, with it, we get that issue ^^ https://github.com/graphaware/neo4j-uuid/files/900664/log.txt

ikwattro commented 7 years ago

Thx @albert-the-creator .

cc @omarlarus @albertodelazzari Can you guys check this please ?

ikwattro commented 7 years ago

@albert-the-creator please provide versions of all your plugins and config

creatzor commented 7 years ago

Using the Neo4J Enterprise 3.1.3 from docker. We use the default config plus:

dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware
com.graphaware.runtime.enabled=true
com.graphaware.module.UIDM.1=com.graphaware.module.uuid.UuidBootstrapper

We use the following plugins:

plugins/graphaware-server-enterprise-all-3.1.3.45.jar plugins/graphaware-uuid-3.1.3.45.14.jar

@ikwattro

ikwattro commented 7 years ago

Thanks, can you please upgrade framework version to .46 and use .45 for graphaware-uuid

There was a bug in .45 and we removed it, thanks

https://products.graphaware.com/download/framework-server-enterprise/graphaware-server-enterprise-all-3.1.3.46.jar

omarlarus commented 7 years ago

@albert-the-creator when you start the cluster with the plugin, is the database empty?

creatzor commented 7 years ago

@omarlarus the db is not empty when we start the cluster. we'll try the new plugins and report back @ikwattro

creatzor commented 7 years ago

When we actually enable the graphaware plugin it fails to start

com.graphaware.runtime.enabled=true

With the new plugins: (Timetree and Noderank aren't in use btw) screen shot 2017-04-11 at 11 36 32 am

We tried the new plugins and it causes the leader to not start. 1.new-log.txt 2.leader-log.txt

When we have an empty db, the cluster doesn't start. @ikwattro @omarlarus

Update After waiting a long while, the leader started, but the neo4j browser doesn't let us sign in, so there may still be something wrong there.. No additional logs. Without the plugin all works as normal

creatzor commented 7 years ago

Any updates on this? Are we doing something wrong?

omarlarus commented 7 years ago

@albert-the-creator try with this jar: https://wetransfer.com/downloads/a3ba7e984b32c665e32f67701dc98dea20170414073416/56722ec96cd2b789a83df32692c7bd3c20170414073416/5ac3ae

creatzor commented 7 years ago

@omarlarus the servers don't start, they get stuck at "Attempting to connect to the other cluster members before continuing..."

We're still trying to figure out whats causing it to fail. I was able to successfully run it on my mac after delaying the startup of each cluster member, but it seems to be failing in our production environment.

4/14/2017 10:29:14 PMException in thread “GraphAware Starter” org.neo4j.graphdb.TransactionFailureException: Transaction was marked as successful, but unable to commit transaction so rolled back.

Server_0.txt Server_1.txt Server_3.txt

UPDATE: they started after we used a 1 min delay between starting each server

.. sometimes it starts other times it doesnt (same behavior on my mac)

omarlarus commented 7 years ago

You should test it on a brand new cluster, with a new database, empty and without previous graphaware module installed. It seems that a previous configuration it's found in the metadata and something goes wrong.

creatzor commented 7 years ago

@omarlarus how can we delete that metadata for existing dbs? We've been running our app in production so we can't start over

creatzor commented 7 years ago

trying it with a fresh db locally, i get:

2017-04-18 17:09:58.458+0000 INFO Discovering cluster with initial members: [localhost:5000, localhost:5001, localhost:5002] 2017-04-18 17:09:58.458+0000 INFO Attempting to connect to the other cluster members before continuing... 2017-04-18 17:10:14.147+0000 ERROR Failed to start Neo4j: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@29cee68' was successfully initialized, but failed to start. Please see attached cause exception. Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@29cee68' was successfully initialized, but failed to start. Please see attached cause exception. org.neo4j.server.ServerStartupException: Starting Neo4j failed: Component 'org.neo4j.server.database.LifecycleManagingDatabase@29cee68' was successfully initialized, but failed to start. Please see attached cause exception. at org.neo4j.server.exception.ServerStartupErrors.translateToServerStartupError(ServerStartupErrors.java:68) at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:230) at org.neo4j.server.ServerBootstrapper.start(ServerBootstrapper.java:91) at org.neo4j.server.ServerBootstrapper.start(ServerBootstrapper.java:68) at org.neo4j.server.enterprise.EnterpriseEntryPoint.main(EnterpriseEntryPoint.java:32) Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.server.database.LifecycleManagingDatabase@29cee68' was successfully initialized, but failed to start. Please see attached cause exception. at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:443) at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:107) at org.neo4j.server.AbstractNeoServer.start(AbstractNeoServer.java:202) ... 3 more Caused by: java.lang.RuntimeException: Error starting org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory, /Users/alfrimpong/Documents/Neo Cores/core-01/data/databases/graph.db at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.initFacade(GraphDatabaseFacadeFactory.java:199) at org.neo4j.causalclustering.core.CoreGraphDatabase.(CoreGraphDatabase.java:56) at org.neo4j.causalclustering.core.CoreGraphDatabase.(CoreGraphDatabase.java:47) at org.neo4j.server.enterprise.EnterpriseNeoServer.lambda$static$2(EnterpriseNeoServer.java:96) at org.neo4j.server.database.LifecycleManagingDatabase.start(LifecycleManagingDatabase.java:89) at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:433) ... 5 more Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.causalclustering.core.state.CoreState@5f0dfedb' was successfully initialized, but failed to start. Please see attached cause exception. at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:443) at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:107) at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.initFacade(GraphDatabaseFacadeFactory.java:195) ... 10 more Caused by: java.lang.RuntimeException: org.neo4j.kernel.impl.transaction.log.NoSuchTransactionException: Unable to find transaction 1 in any of my logical logs: Couldn't find any log containing 1 at org.neo4j.causalclustering.core.state.machines.tx.LastCommittedIndexFinder.getLastCommittedIndex(LastCommittedIndexFinder.java:67) at org.neo4j.causalclustering.core.state.machines.tx.RecoverConsensusLogIndex.findLastAppliedIndex(RecoverConsensusLogIndex.java:48) at org.neo4j.causalclustering.core.state.machines.CoreStateMachines.installCommitProcess(CoreStateMachines.java:132) at org.neo4j.causalclustering.core.state.CoreState.start(CoreState.java:187) at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:433) ... 12 more Caused by: org.neo4j.kernel.impl.transaction.log.NoSuchTransactionException: Unable to find transaction 1 in any of my logical logs: Couldn't find any log containing 1 at org.neo4j.kernel.impl.transaction.log.PhysicalLogicalTransactionStore$LogVersionLocator.getLogPosition(PhysicalLogicalTransactionStore.java:223) at org.neo4j.kernel.impl.transaction.log.PhysicalLogicalTransactionStore.getTransactions(PhysicalLogicalTransactionStore.java:83) at org.neo4j.causalclustering.core.state.machines.tx.LastCommittedIndexFinder.getLastCommittedIndex(LastCommittedIndexFinder.java:57) ... 16 more

UPDATE:

After I renamed the cluster state folders, it started working

fairy3 commented 7 years ago

@albert-the-creator ,can you detail about renaming state folders, please? We began to work with 3.2.2, and still face the problem: when db is empty for all three cores, the cluster is up, when we try to seed it by db from other single instance, it fails to start and we get the same exception.

creatzor commented 7 years ago

@fairy3 its actually not recommended to mess with those folders, so i'd recommend not renaming them. Try exporting your seed db with the neo4j backup tool, maybe something went wrong. (if you already do this, then im not sure hah sorry)

but as of late, we haven't had issues with graphaware itself. the issues we've had have been network issues causing the db not to start sometimes

fairy3 commented 7 years ago

It's a pity, @albert-the-creator I don't think there are network problems - there are no firewall and all cores are reachable by each other... So, probably I should open an issue. Thanks

graphaware / neo4j-uuid

Causal Clusters Failing #34