qubole / rubix

Cache File System optimized for columnar formats and object stores
Apache License 2.0
183 stars 74 forks source link

emr and rubbix issue #255

Open deema0 opened 5 years ago

deema0 commented 5 years ago

18/12/28 13:29:50 INFO RetryingBookkeeperClient: Error while connecting : org.apache.thrift.shaded.TApplicationException: getCacheStatus failed: unknown result ... 18/12/28 13:29:50 INFO CachingInputStream: Could not get cache status from server org.apache.thrift.shaded.TException at com.qubole.rubix.spi.RetryingBookkeeperClient.retryConnection(RetryingBookkeeperClient.java:95) at com.qubole.rubix.spi.RetryingBookkeeperClient.getCacheStatus(RetryingBookkeeperClient.java:47) at com.qubole.rubix.core.CachingInputStream.setupReadRequestChains(CachingInputStream.java:304) at com.qubole.rubix.core.CachingInputStream.readInternal(CachingInputStream.java:230) at com.qubole.rubix.core.CachingInputStream.read(CachingInputStream.java:184)

And I can see following in /var/log/rubix/bks.log:

18/12/28 13:46:25,393 ERROR pool-6-thread-5 bookkeeper.BookKeeper: Could not initialize cluster nodes=[ip-172-31-21-10.us-west-2.compute.internal, ip-172-31-23-67.us-west-2.compute.internal] nodeHostName=ip-172- 31-31-168.us-west-2.compute.internal nodeHostAddress=172.31.31.168 currentNodeIndex=-1 18/12/28 13:46:25,393 ERROR pool-6-thread-5 bookkeeper.BookKeeper: Node name is null for Cluster TypeHADOOP2_CLUSTER_MANAGER 18/12/28 13:46:25,394 ERROR pool-6-thread-5 bookkeeper.BookKeeper: Could not initialize cluster nodes=[ip-172-31-21-10.us-west-2.compute.internal, ip-172-31-23-67.us-west-2.compute.internal] nodeHostName=ip-172- 31-31-168.us-west-2.compute.internal nodeHostAddress=172.31.31.168 currentNodeIndex=-1 18/12/28 13:46:25,394 ERROR pool-6-thread-5 bookkeeper.BookKeeper: Node name is null for Cluster TypeHADOOP2_CLUSTER_MANAGER 18/12/28 13:46:25,394 ERROR pool-6-thread-5 bookkeeper.BookKeeper: Could not initialize cluster nodes=[ip-172-31-21-10.us-west-2.compute.internal, ip-172-31-23-67.us-west-2.compute.internal] nodeHostName=ip-172- 31-31-168.us-west-2.compute.internal nodeHostAddress=172.31.31.168 currentNodeIndex=-1 18/12/28 13:46:25,394 ERROR pool-6-thread-5 bookkeeper.BookKeeper: Node name is null for Cluster TypeHADOOP2_CLUSTER_MANAGER

Hi,

We are aware of this issue. The error is related to nodeName and currentNodeIndex not being set. The main reason is the list of nodes that cluster manager provides doesn't include the master node. The driver running in the master node is trying to get the cache status of some file from its local bookkeeper and that call is throwing the exception. The exception is not going to cause any job failure and the executors should be able to read the data from rubix cache properly.

You can file an issue in github and we will take it as priority. Please let us know if your main job is failing because of this exception.

Regards, Abhishek