Timed out waiting to get lock in kubernates

ufoscout commented 5 years ago

I have a clustered vertx deployment in kubernates, everything works perfectly except that it is impossible to acquire any lock. I can get/set data from/to the shareData and I can even send events in the event bus, anyway, any attempt to use sharedData.getLock() or sharedData.getLockWithTimeout() fails with a timeout exception. This weird behavior happens even with a single node in the cluster.

My code:

       var lock: Lock? = null
        try {
            lock = awaitResult{
                 sharedData.getLockWithTimeout("BATCH_STATUS_LOCK", 2500, it)
            }
        } catch (e: RuntimeException) {
            logger.error("Batch [{}] Execution failed", name, e)
        } finally {
            if (lock != null ){
                lock.release()
            }
        }

The exception:

io.vertx.core.VertxException: Timed out waiting to get lock BATCH_STATUS_LOCK
        at io.vertx.spi.cluster.hazelcast.HazelcastClusterManager.lambda$getLockWithTimeout$3(HazelcastClusterManager.java:217)
        at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$2(ContextImpl.java:272)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)

I am using vertx 3.6.2 and hazelcast-kubernates 1.2.2

the vertx cluster configuration:

    suspend fun getVertx(): Vertx {
            val hazelcastConfig = com.hazelcast.config.Config()

            hazelcastConfig.setProperty( "hazelcast.logging.type", "slf4j" );
            hazelcastConfig.setProperty("hazelcast.discovery.enabled", "true")

            hazelcastConfig.networkConfig.join.multicastConfig.isEnabled = false
            hazelcastConfig.networkConfig.join.tcpIpConfig.isEnabled = false

            val strategyConfig = DiscoveryStrategyConfig(HazelcastKubernetesDiscoveryStrategyFactory());
            strategyConfig.addProperty("namespace", "default")
            strategyConfig.addProperty("service-name", "vertx-cluster-service")
            hazelcastConfig.getNetworkConfig().getJoin().getDiscoveryConfig().addDiscoveryStrategyConfig(strategyConfig);

            val mgr = HazelcastClusterManager(hazelcastConfig);
            val options = VertxOptions().setClusterManager(mgr)

            return awaitResult<Vertx> {
                Vertx.clusteredVertx(options, it)
            }
    }

tsegismont commented 5 years ago

@ufoscout thanks for reporting.

I don't believe the Hazelcast instance config is an issue here: 1/ your nodes discover each other 2/ other shared data apis work fine

Can you reproduce on your development machine? Can you provide a full reproducer?

ufoscout commented 5 years ago

@tsegismont As a workaround, I used a counter instead of the lock. I could try to create a reproducer, but this is probably the busiest moment of my life... :(

tsegismont commented 5 years ago

Thanks for letting me know. I'll go ahead and close. Please reopen if/when you can provide a reproducer.

adhesivee commented 5 years ago

I encountered the same. The issue was using the wrong configuration: So instead of using:

val hazelcastConfig = com.hazelcast.config.Config()

Use:

val hazelcastConfig = ConfigUtil.loadConfig()

tsegismont commented 5 years ago

Thanks for sharing @adhesivee !

flyGetHu commented 8 months ago

这个还跟集群的配置文件有关: <?xml version="1.0" encoding="UTF-8"?>

<multimap name="__vertx.subs">
    <backup-count>1</backup-count>
    <value-collection-type>SET</value-collection-type>
</multimap>

<map name="__vertx.haInfo">
    <backup-count>1</backup-count>
</map>

<map name="__vertx.nodeInfo">
    <backup-count>1</backup-count>
</map>

<cp-subsystem>
    <cp-member-count>0</cp-member-count>
    <semaphores>
        <semaphore>
            <name>__vertx.*</name>
            <jdk-compatible>false</jdk-compatible>
            <initial-permits>1</initial-permits>
        </semaphore>
    </semaphores>
</cp-subsystem>

<properties>
    <property name="hazelcast.logging.type">log4j2</property>
</properties>

上面这份配置是可以的,下面的就有问题,应该是cp子系统的问题: <?xml version="1.0" encoding="UTF-8"?>

<!--  配置文档  https://docs.hazelcast.com/imdg/latest/-->

<multimap name="__vertx.subs">
    <backup-count>1</backup-count>
    <value-collection-type>SET</value-collection-type>
</multimap>

<map name="__vertx.haInfo">
    <backup-count>1</backup-count>
</map>

<map name="__vertx.nodeInfo">
    <backup-count>1</backup-count>
</map>
<!--   cp子系统保证集群使用分布式锁,分布式计数器和共享map等操作保证数据一致性     -->
<!--    <cp-subsystem>-->
<!--        &lt;!&ndash;   为了保证cp系统可用,可以启动 cp-member-count 个与业务无关的服务,维护cp系统最小成员    &ndash;&gt;-->
<!--        <cp-member-count>0</cp-member-count>-->
<!--        <group-size>0</group-size>-->
<!--        <session-time-to-live-seconds>30</session-time-to-live-seconds>-->
<!--        <session-heartbeat-interval-seconds>3</session-heartbeat-interval-seconds>-->
<!--        <missing-cp-member-auto-removal-seconds>360</missing-cp-member-auto-removal-seconds>-->
<!--        <fail-on-indeterminate-operation-state>true</fail-on-indeterminate-operation-state>-->
<!--        <semaphores>-->
<!--            <semaphore>-->
<!--                <name>__vertx.*</name>-->
<!--                <jdk-compatible>false</jdk-compatible>-->
<!--                <initial-permits>1</initial-permits>-->
<!--            </semaphore>-->
<!--        </semaphores>-->
<!--    </cp-subsystem>-->

<properties>
    <property name="hazelcast.logging.type">log4j2</property>
</properties>

vert-x3 / vertx-hazelcast

Timed out waiting to get lock in kubernates #106