Issue when scaling ms to more then one - vertx timeout

roeP commented 4 years ago

Hi,

I've setup hazelcast API discovery: ` <discovery-strategy enabled="true" class="com.hazelcast.kubernetes.HazelcastKubernetesDiscoveryStrategy">

cluster01

            </discovery-strategy>
        </discovery-strategies>`

and service for each ms: `apiVersion: v1 kind: Service metadata: name: test-ms namespace: {{ .Values.nameSpace }} labels: cluster01: "true" environment: {{ .Values.environment }} spec: ports:

name: http port: 8000 targetPort: 8000
name: hazelcast port: 5701
name: vertex port: 15701 targetPort: 15701 selector: app: test-ms branch: {{ .Values.branch }} environment: {{ .Values.environment }} and the ms deployment looks like that: containers:
- name: test-ms image: "{{ .Values.image.repository }}/test-ms:{{ .Chart.AppVersion }}" args: ["-conf", "/usr/share/config.json", "-cluster", "-cluster-port", "15701", "-cluster-host", "account-ms"]`

When I use only 1 ms everrythink works as expected, each ms is join the cluster, however when I scale the ms to 2 or more I get timeout errors on vert.x (all ms's are joining the cluster): io.vertx.core.eventbus.ReplyException: Timed out after waiting 30000(ms) for a reply. address: __vertx.reply.1ea6d9e7-b99b-4144-9f62-b01fceb0249b, repliedAddress: backend.user.read.verify.authenticated at io.vertx.core.eventbus.impl.HandlerRegistration.lambda$new$0(HandlerRegistration.java:78) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:923) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:887) at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:369) at io.vertx.core.impl.EventLoopContext.execute(EventLoopContext.java:43) at io.vertx.core.impl.ContextImpl.executeFromIO(ContextImpl.java:232) at io.vertx.core.impl.ContextImpl.executeFromIO(ContextImpl.java:224) at io.vertx.core.impl.VertxImpl$InternalTimerHandler.run(VertxImpl.java:913) at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:139) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:832)

It looks like some of the calls goes from test-ms A and some goes to test-ms B and for some reason vert.x eventbus gets timeouts, also if the calls sometimes using only 1 ms the calls are working. The app logic is not a problem since each ms is independed and doesnt missing information (the error is timeout and not logic related).

How can I tackle this issue? is it a hazelcast plugin configuration issue or vert.x discovery issue? Do I need to configure that each call will remain under the same set of ms's or something like that?

leszko commented 4 years ago

@roeP Are you able to reproduce the issue with Hazelcast alone (without vert.x)?

roeP commented 4 years ago

@leszko No, I'm afraid that I cant do it without breaking the code. We currently running on docker swarm without ms scaling and i'm working on k8s migration so we are pretty deep with vert.x in our code..

hazelcast / hazelcast-kubernetes

Issue when scaling ms to more then one - vertx timeout #206