pravega / pravega

Pravega - Streaming as a new software defined storage primitive
http://pravega.io
Apache License 2.0
1.99k stars 406 forks source link

lab deployment: Tier1 should retry acquiring DL lock for the container #277

Closed arvindkandhare closed 7 years ago

arvindkandhare commented 7 years ago

There is a chance the lock may not be available for the first time. Tier1 should backoff and try again.

21:29:01.680 [pool-1-thread-4] ERROR com.emc.pravega.service.server.logs.DurableLog - DurableLog[1] Recovery FAILED. {} com.emc.pravega.service.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive Writer for log 'container_1'. at com.emc.pravega.service.storage.impl.distributedlog.LogHandle.initialize(LogHandle.java:173) at com.emc.pravega.service.storage.impl.distributedlog.LogClient.getLogHandle(LogClient.java:177) at com.emc.pravega.service.storage.impl.distributedlog.DistributedLogDataLog.initialize(DistributedLogDataLog.java:80) at com.emc.pravega.service.server.logs.DurableLog.performRecovery(DurableLog.java:305) at com.emc.pravega.service.server.logs.DurableLog.lambda$doStart$0(DurableLog.java:144) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: com.twitter.distributedlog.exceptions.OwnershipAcquireFailedException: LockPath - /messaging/distributedlog/mynamespace/container_1//lock: Lock acquisition failed, the current owner is interactive-console at com.twitter.distributedlog.lock.ZKSessionLock$8.apply(ZKSessionLock.java:587) at com.twitter.distributedlog.lock.ZKSessionLock$8.apply(ZKSessionLock.java:584) at com.twitter.util.Future$$anonfun$map$1$$anonfun$apply$8.apply(Future.scala:1042) at com.twitter.util.Try$.apply(Try.scala:13) at com.twitter.util.Future$.apply(Future.scala:132) at com.twitter.util.Future$$anonfun$map$1.apply(Future.scala:1042) at com.twitter.util.Future$$anonfun$map$1.apply(Future.scala:1041) at com.twitter.util.Promise$Transformer.liftedTree1$1(Promise.scala:112) at com.twitter.util.Promise$Transformer.k(Promise.scala:112) at com.twitter.util.Promise$Transformer.apply(Promise.scala:122) at com.twitter.util.Promise$Transformer.apply(Promise.scala:103) at com.twitter.util.Promise$$anon$1.run(Promise.scala:357) at com.twitter.concurrent.LocalScheduler$Activation.run(Scheduler.scala:178) at com.twitter.concurrent.LocalScheduler$Activation.submit(Scheduler.scala:136) at com.twitter.concurrent.LocalScheduler.submit(Scheduler.scala:207) at com.twitter.concurrent.Scheduler$.submit(Scheduler.scala:92) at com.twitter.util.Promise.runq(Promise.scala:350) at com.twitter.util.Promise.updateIfEmpty(Promise.scala:716) at com.twitter.util.Promise.update(Promise.scala:694) at com.twitter.util.Promise.setValue(Promise.scala:670) at com.twitter.distributedlog.lock.ZKSessionLock$9.safeRun(ZKSessionLock.java:639) at org.apache.bookkeeper.util.SafeRunnable.run(SafeRunnable.java:31) ... 7 common frames omitted

tkaitchuck commented 7 years ago

The line

OwnershipAcquireFailedException: LockPath - /messaging/distributedlog/mynamespace/container_1//lock: Lock acquisition failed, the current owner is interactive-console

Makes me think that this is not a retryable thing. There is another owner. Am I missing something?

arvindkandhare commented 7 years ago

The scenario is where ownership change is triggered by the controller. The new owner is supposed to take over the container and hits this exception. If we do not retry, this container will basically be unserviced indefinitely.

arvindkandhare commented 7 years ago

Here are the logs and more details.

As the target host tries only once to acquire the lock because of the issue shown in the logs below, the acquirelock is tried only once. If both the instances are restarted again, the right instance acquires the lock sometimes. The restart can be avoided if the target host keeps retrying locks for the containers it knows it owns.

  1. On host 1, when it loses the ownership of the container, it hits this exception:

java.util.concurrent.TimeoutException: Timeout on waiting for ReadAhead worker to stop container_1: at com.twitter.distributedlog.readahead.ReadAheadWorker.asyncClose(ReadAheadWorker.java:377) at com.twitter.distributedlog.io.AsyncCloseable$1.apply(AsyncCloseable.java:34) at com.twitter.distributedlog.io.AsyncCloseable$1.apply(AsyncCloseable.java:31) at com.twitter.distributedlog.util.FutureUtils$ListFutureProcessor.run(FutureUtils.java:201) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

  1. At host 2, the acquire fails.: com.emc.pravega.service.storage.DataLogWriterNotPrimaryException: Unable to acquire exclusive Writer for log 'container_1'. at com.emc.pravega.service.storage.impl.distributedlog.LogHandle.initialize(LogHandle.java:173) at com.emc.pravega.service.storage.impl.distributedlog.LogClient.getLogHandle(LogClient.java:177) at com.emc.pravega.service.storage.impl.distributedlog.DistributedLogDataLog.initialize(DistributedLogDataLog.java:80) at com.emc.pravega.service.server.logs.DurableLog.performRecovery(DurableLog.java:305) at com.emc.pravega.service.server.logs.DurableLog.lambda$doStart$0(DurableLog.java:144) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: com.twitter.distributedlog.exceptions.OwnershipAcquireFailedException: LockPath - /messaging/distributedlog/mynamespace/container_1//lock: Lock acquisition failed, the current owner is lviprbig250155.lss.emc.com-12345
arvindkandhare commented 7 years ago

This issue is no more observed with the latest code. Will reopen if this is observed again.