hortonworks-spark / shc

The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink.
Apache License 2.0
552 stars 281 forks source link

Added Region caching #306

Open Kamil-Krynicki opened 5 years ago

Kamil-Krynicki commented 5 years ago

What changes were proposed in this pull request?

My team and I have been testing the shc connector with high and very high throughput and we realized that its performance dropped significantly around 400 k reads per second and, prior to these changes, we could never go above 500 k reads per second in a stable manner.

We managed to pinpoint and patch the problem. It was related to repeated region queries that saturated the cluster.

We have exposed our changes via parameters, which are disabled by default.

How was this patch tested?

This patch was tested manually. It has also been used extensively in our tests in the CERN's NxCals project for the past 3 weeks. We have deemed it stable and efficient enough to move it to our production environment.