Hazelcast Discovery on Startup Not Adequate

AlexandreCassagne commented 4 years ago

Overview of the issue

Sometimes, when starting instances simultaneously, services fail to discover each other and so several Hazelcast clusters are created and never merge. One piece of evidence to prove this is that all the different instances Hazelcast log include:

Members {size:1, ver:1} [
    Member [10.68.6.109]:5701 - 693df0fa-63a9-4015-97ec-48feea1cf483 this
]

All of the other instances show a similar cluster size and this never gets updated. Another hint that this is happening at discovery is that if I restart 3 out of 4 instances, after about 1min they all suddenly join the older instance (cluster size goes to 4).

I have done a bit of investigation, and the critical code path seems to be in CacheConfiguration, where all the discovery happens at startup:

config.getNetworkConfig().setPort(5701);
config.getNetworkConfig().getJoin().getTcpIpConfig().setEnabled(true);
for (ServiceInstance instance : discoveryClient.getInstances(serviceId)) {
     String clusterMember = instance.getHost() + ":5701";
     log.debug("Adding Hazelcast (prod) cluster member {}", clusterMember);
     config.getNetworkConfig().getJoin().getTcpIpConfig().addMember(clusterMember);
}

this code sets the instance's own Hazelcast network configuration, then iterates over the discovery client's other visible instances. Unfortunately, as this is quite early in the lifecycle of the microservice, other services may not have registered yet, so the clusters never join.

Motivation for or Use Case

For those using a second level cache, instances that do not join mean inconsistencies, possibly even security critical ones. For instance, a user's role may be removed, but eviction of the user's authorities only happens on that 'cluster'.

Reproduce the error

For me (perhaps my configuration is specific) this happens anytime I launch many instances simultaneously. I am using Kubernetes, so starting a deployment with several replicas is simple.

If just one instance is visible when the other instances are started, the problem goes away (e.g. start one instance, wait for it to be properly started, then increase replica count).

Related issues

Suggest a Fix

If others agree that this is an issue, we should decide a way forward. There are several considerations:

Hazelcast supports a Kubernetes discovery mode, which uses Kubernetes API for service discovery. I think that implementing it may be the cleanest way for users on Kubernetes. See https://github.com/hazelcast/hazelcast-kubernetes for details
Perhaps the JHipster Registry itself needs to guarantee that registration is only given to services once they are visible on all instances of the registry (not sure how it works to be honest!)
There could be a scheduled runner, as a workaround, that regularly probes to ensure all instances visible in the Discovery Client are properly merged in a single Hazelcast cluster.
My own temporary workaround: I added an Init Container that waits until at least one instance is live, then starts all the other instances. An ugly solution that works 90% of the time... but not always! I think the JHipster Registry's responses are not always reliable.

Once a decision is made, I'll be happy to help implement any changes.

JHipster Version(s)

v6.5.1

JHipster configuration

.yo-rc.json file

{
  "generator-jhipster": {
    "promptValues": {
      "packageName": "redacted"
    },
    "jhipsterVersion": "6.5.1",
    "applicationType": "microservice",
    "baseName": "simplifier",
    "packageName": "redacted",
    "packageFolder": "redacted",
    "serverPort": "9050",
    "authenticationType": "uaa",
    "uaaBaseName": "uaa",
    "cacheProvider": "hazelcast",
    "enableHibernateCache": true,
    "websocket": false,
    "databaseType": "no",
    "devDatabaseType": "no",
    "prodDatabaseType": "no",
    "searchEngine": false,
    "messageBroker": "kafka",
    "serviceDiscoveryType": "eureka",
    "buildTool": "gradle",
    "enableSwaggerCodegen": false,
    "testFrameworks": [],
    "jhiPrefix": "jhi",
    "entitySuffix": "",
    "dtoSuffix": "DTO",
    "otherModules": [],
    "enableTranslation": false,
    "clientPackageManager": "npm",
    "skipClient": true,
    "skipUserManagement": true,
    "blueprints": [],
    "embeddableLaunchScript": false,
    "directoryPath": "../",
    "dockerRepositoryName": "",
    "dockerPushCommand": "docker push",
    "kubernetesNamespace": "test",
    "kubernetesServiceType": "LoadBalancer",
    "ingressDomain": "",
    "monitoring": "no",
    "istio": false
  }
}

[x] Checking this box is mandatory (this is just to show you read everything)

ruddell commented 4 years ago

There is a related issue https://github.com/jhipster/generator-jhipster/issues/9521 but not enough info was provided there.

Here are steps to reproduce locally through Docker with just the Registry and a Monolith (no database or client). I wrote these up a while ago but then suddenly couldn't reproduce it. It seems to happen more often with a new Registry container

.yo-rc.json file

{
  "generator-jhipster": {
    "promptValues": {
      "packageName": "com.mycompany.myapp"
    },
    "jhipsterVersion": "6.5.1",
    "applicationType": "monolith",
    "baseName": "jhipster",
    "packageName": "com.mycompany.myapp",
    "packageFolder": "com/mycompany/myapp",
    "serverPort": "8080",
    "authenticationType": "jwt",
    "cacheProvider": "hazelcast",
    "enableHibernateCache": false,
    "websocket": false,
    "databaseType": "no",
    "devDatabaseType": "no",
    "prodDatabaseType": "no",
    "searchEngine": false,
    "messageBroker": false,
    "serviceDiscoveryType": "eureka",
    "buildTool": "maven",
    "enableSwaggerCodegen": false,
    "jwtSecretKey": "Y2I3MzA3ZjIyM2MzOTI1YzZjNGUwODYxY2U5NjkzNGQ3NDc5YTUyNjBiYzlmZTY3YmRmMDVkMWVhNjg1NjM3ZGMzZWY1MWJhYmU2YzEzZjg3ZTgyZmI3NDYzYTEwMWRiZGQyN2JiYWUxMTE3MThhMDUxODM5YmU2NTU1MDY1ZGE=",
    "embeddableLaunchScript": false,
    "useSass": true,
    "clientPackageManager": "npm",
    "skipClient": true,
    "creationTimestamp": 1576104597610,
    "testFrameworks": [],
    "jhiPrefix": "jhi",
    "entitySuffix": "",
    "dtoSuffix": "DTO",
    "otherModules": [],
    "enableTranslation": false,
    "blueprints": []
  }
}

Generate app from above .yo-rc.json
Remove ports key and value from src/main/docker/app.yml
Build the Docker image ./mvnw package -Pprod verify jib:dockerBuild -DskipTests
Start the stack with two instances of the mono app docker-compose -f src/main/docker/app.yml up -d --scale jhipster-app=2
View the logs of one of the mono containers, it will be in its own Hazelcast cluster docker logs -f docker_jhipster-app_1
```
Members {size:1, ver:1} [
Member [172.18.0.4]:5701 - 07a5e6c6-0c9c-48cb-a8a2-bac717295c56 this
]
```

AlexandreCassagne commented 4 years ago

@ruddell Thanks for the minimal example, when you say it happens more often with a new Registry container, what do you mean?

ruddell commented 4 years ago

If I recreate the entire stack this issue happens more consistently. When restarting just the mono containers, sometimes they find each other on startup.

AlexandreCassagne commented 4 years ago

Interesting. I haven’t noticed that, it’s consistent even without touching the registries.

On 12 Dec 2019, at 14:40, Jon Ruddell notifications@github.com wrote:

If I recreate the entire stack this issue happens more consistently. When restarting just the mono containers, sometimes they find each other on startup.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

gmarziou commented 4 years ago

I experienced similar issue with Hazelcast and Eureka service discovery. I got it working by replacing JHipster's code using Eureka to join Hazelcast cluster by code using hazelcast-eureka-one.

So, using hazelcast-kubernetes might help too in your case.

pascalgrimaud commented 4 years ago

I'm adding a bounty on this, as it seems to be a real bug

jdubois commented 4 years ago

Yes this important, and it's logical we have this kind of issue. I'm currently working with the Hazelcast team (for my "real" work!!) - for the record, I'm having a look at https://github.com/hazelcast/hazelcast-azure so maybe I can find some idea.

Not sure we can easily find a generic solution.

gmarziou commented 4 years ago

In the case of Eureka, I am using metadata for host and port and I think that the main difference is in the way cluster join is hooked to Spring events. I followed the Hazelcast sample that injects an EurekaClient while JHipster's code uses a generic DiscoveryClient, it's supposed to make no difference but maybe the issue comes from Spring Cloud bootstrapping. My project is still in SB 1.5 so maybe it does no longer apply.

I tend to think that JHipster configuration code should follow the code samples provided by Hazelcast even though it requires adding specific dependency on hazelcast-eureka, hazelcast-azure or hazelcast-kubernetes. This configuration could easily be isolated in specific classes.

PierreBesson commented 4 years ago

For me this looks like a potentially huge bug that thankfully happens rarely (I have never seen it). Indeed there is no failover for cluster discovery if the initial registration fails at startup time. We should definitely change our code to use the Hazelcast recommended way even if it uses technology specific dependencies. Those can be conditionally enabled on the application by using the combination of maven/gradle + spring profile like we did for zipkin support.

AlexandreCassagne commented 4 years ago

Sadly, it is easy to reproduce for me - in fact, it happens anytime I launch all instances simultaneously (e.g. kill all instances, wait, and start again).

Perhaps my configuration is slightly different though.

On 31 Dec 2019, at 16:56, Pierre Besson notifications@github.com wrote:

For me this looks like a potentially huge bug that thankfully happens rarely (I have never seen it). Indeed there is no failover for cluster discovery if the initial registration fails at startup time. We should definitely change our code to use the Hazelcast recommended way even if it uses technology specific dependencies. Those can be conditionally enabled on the application by using the combination of maven/gradle + spring profile like we did for zipkin support.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

antarus commented 4 years ago

I have the same problem with kubernetes. Even adding the dependency kubernetes-hazelcast and configuring it. My current 'solution', I launch a replicas and once launched, I scale...

I plan to test with redis ..

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 30 days with no activity. Our core developers tend to be more verbose on denying. If there is no negative comment, possibly this feature will be accepted. We are accepting PRs :smiley:. Comment or this will be closed in 7 days

AlexandreCassagne commented 4 years ago

I believe the issue is related to the readiness probes. Let's revisit this when the Spring Boot 2.3. release of JHipster is published.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Our core developers tend to be more verbose on denying. If there is no negative comment, possibly this feature will be accepted. We are accepting PRs :smiley:. Comment or this will be closed in 7 days

Tcharl commented 2 years ago

@AlexandreCassagne time to dig into this again ;-)

AlexandreCassagne commented 2 years ago

Haha, perhaps so. I will look into it, I think this should be a rather quick issue to resolve.

DanielFran commented 2 years ago

@AlexandreCassagne did you have news? Issue still persist?

AlexandreCassagne commented 2 years ago

Sorry @DanielFran, I dropped the ball on this one.

@jdubois did you get inspiration from working with Azure and the Hazelcast team?

AlexandreCassagne commented 2 years ago

@DanielFran I have some work going on but I want to close this once and for all ;-)

@jdubois is it ok to completely rewrite the cache configuration portion and build a kubernetes-only version? I want to migrate away from the Jhipster Registry, I think Eureka is no longer needed but that requires a lot of configuration changes (config maps? Kubernetes discovery API? cache configuration? etc)

jdubois commented 2 years ago

@AlexandreCassagne for Azure + Hazelcast it's a very specific implementation that uses something called Azure Ping, so it would only work on Azure VMs (so this is quite a limited use case). Concerning Eureka, I'm afraid it's the only way to get the nodes know about each other if you follow the usual JHipster/Spring architecture. You can probably rewrite this in a k8s-specific way, but then it would only work on k8s, and I'm afraid this would be another limitation as many people don't use k8s.

DanielFran commented 1 year ago

@jdubois @gmarziou @PierreBesson @AlexandreCassagne This issue is stale for long time. Closing for now. In case of any issue please open a new issue.

jhipster / generator-jhipster