googleapis / java-bigtable-hbase

Java libraries and HBase client extensions for accessing Google Cloud Bigtable
https://cloud.google.com/bigtable/
Apache License 2.0
174 stars 179 forks source link

Beam + HBase 2.x support #2485

Open clehene opened 4 years ago

clehene commented 4 years ago

It seems that the Beam library (https://mvnrepository.com/artifact/com.google.cloud.bigtable/bigtable-hbase-beam/1.14.0) depends on hbase 1.x.

Is there a way to use Beam with the HBase 2.x client?

igorbernstein2 commented 4 years ago

We would need to publish a different artifact. hbase 2.x broke binary compatibility with hbase 1.x (which why we needed to create bigtable-hbase-1.x & bigtable-hbase-2.x). To support hbase 2x for beam, we would need to create hbase-2.x-beam. Or we need to define shims that abstract the compatibility differences.

I'd like to gain better understanding about your use case. What is the motivation for wanting hbase 2.x apis? What are the features that you are interested in? The primary hbase surface that we expose in beam is the entities (Scans, Mutations, Puts, Results, etc) and as far as I know that has changed much in hbase 2.x

Looking forward to your answers.

Best, Igor

clehene commented 4 years ago

Hi Igor.

We were already using HBase 2.x and Beam. We switched to BigTable and bigtable-hbase-2.x works great but the Beam jobs started failing. I don't fully understand why bigtable-hbase-2.x doesn't do it out of the box and why an additional 3-party bigtable-hbase-and-beam is necessary, but I'm guessing it has to do with how Beam uses HBase?

In either case, our jobs have stopped working and we were planning on switching to BigTable so this blocks us.

Is having a bigtable-hbase-2x-beam hard? Would it perhaps be easier to just enable the switch to BigTableIo in our code and side-step this completely?

We've been back and forth with Google Support over this, but their solution didn't seem to actually work, so decided to ask here directly.

igorbernstein2 commented 4 years ago

I don't fully understand why bigtable-hbase-2.x doesn't do it out of the box and why an additional 3-party bigtable-hbase-and-beam is necessary, but I'm guessing it has to do with how Beam uses HBase?

I don't quite understand what you mean here. Are you asking why someone can't just use Beam's HBaseIO with bigtable-hbase-2.x? If so, the problem is that the HBaseIO tries to connect to zookeeper to get splits, which is hard to shim in bigtable-hbase.

Can you clarify what you mean by "switched to BigTable and bigtable-hbase-2.x works great but the Beam jobs started failing". Did you have pre-existing Beam jobs that used HBaseIO?

I think adding HBase 2x support for Beam makes sense, but needs to researched a bit first. The current development efforts are geared to refactoring internals of bigtable-hbase to use http://www.github.com/googleapis/java-bigtable. So adding hbase 2x support to beam would have to come after that.

clehene commented 4 years ago

If so, the problem is that the HBaseIO tries to connect to zookeeper to get splits, which is hard to shim in bigtable-hbase.

is anything preventing the HBaseIO impl to copy how TableInputFormat works using RegionLocator? Does it need to set watchers or something like that?

It would work with bigtable-hbase out of the box https://github.com/googleapis/java-bigtable-hbase/blob/master/bigtable-client-core-parent/bigtable-hbase/src/main/java/com/google/cloud/bigtable/hbase/BigtableRegionLocator.java ?

Can you clarify what you mean by "switched to BigTable and bigtable-hbase-2.x works great

Our jobs use HBaseIO and the rest of the (non Beam) code uses the regular HBase 2.x client.

Did you have pre-existing Beam jobs that used HBaseIO?

Yes. We were using HBase, switched to BigTable and Beam broke.

Since we need something now, I'm looking at any alternative and perhaps the easiest thing to change is to use BigtableIO for the Beam jobs when running in GCP.

igorbernstein2 commented 4 years ago

We've had discussions about fixing the compatibility issues between HBaseIO and bigtable-hbase, but it just hasn't been prioritized yet.

If I recall correctly, this was the problematic bit: https://github.com/apache/beam/blob/8239ff/sdks/java/io/hbase/src/main/java/org/apache/beam/sdk/io/hbase/HBaseUtils.java#L61-L73

In the longer term, fixing this compatibility seems like a viable approach to getting hbase 2.x compatibility in beam.

In the shorter term, have you tried using bigtable-hbase-beam, and forcing hbase-shaded-client 1.x using dependencyManagement in your pipelines? I believe the binary incompatibilities in hbase2x were mostly in the Table & Connection interfaces. The data classes (Put, Result, etc) remained backwards compatible. So downgrading to hbase-shaded-client 1x in your pipelines shouldn't cause issues.

clehene commented 4 years ago

We transitively get bigtable-hbase-2.x (from our "common") and 1.x from Beam. By forcing, you mean excluding the 2.x dependency?

[INFO] |  +- com.google.cloud.bigtable:bigtable-hbase-2.x:jar:1.14.0:compile
[INFO] |  |  +- com.google.cloud.bigtable:bigtable-hbase:jar:1.14.0:compile

and 

[INFO] +- com.google.cloud.bigtable:bigtable-hbase-beam:jar:1.14.0:compile
[INFO] |  +- com.google.cloud.bigtable:bigtable-hbase-1.x-shaded:jar:1.14.0:compile
[

With the above combo we got into this:

Exception in thread "main" java.io.IOException: java.lang.NoSuchMethodException: com.google.cloud.bigtable.hbase1_x.BigtableConnection.<init>(org.apache.hadoop.conf.Configuration, java.util.concurrent.ExecutorService, org.apache.hadoop.hbase.security.User)
    at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:232)
    at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:128)
    at io.tapwater.common.HBaseUtils.<init>(HBaseUtils.java:60)
    at io.tapwater.processor.pipeline.Pipeline.main(Pipeline.java:86)
Caused by: java.lang.NoSuchMethodException: com.google.cloud.bigtable.hbase1_x.BigtableConnection.<init>(org.apache.hadoop.conf.Configuration, java.util.concurrent.ExecutorService, org.apache.hadoop.hbase.security.User)
    at java.lang.Class.getConstructor0(Class.java:3082)
    at java.lang.Class.getDeclaredConstructor(Class.java:2178)
    at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:225)
igorbernstein2 commented 4 years ago

Yes, but I think you also have to make sure to force hbase-shaded-client to 1.4.12.

clehene commented 4 years ago

In case this helps others trying to get this working.

I used mvn dependency:tree to figure out which dependencies transitively include hbase-1.x / hbase-2.x.

Added exclusions for those

      <exclusions>
        <exclusion>
          <groupId>org.apache.hbase</groupId>
          <artifactId>hbase-shaded-client</artifactId>
          <version>${hbase-2.x.version}</version>
        </exclusion>
      </exclusions>

Explicitely added hbase-1.x API deps:

    <dependency>
      <groupId>com.google.cloud.bigtable</groupId>
      <artifactId>bigtable-hbase-beam</artifactId>
    </dependency>

      <dependency>
        <groupId>org.apache.hbase</groupId>
        <artifactId>hbase-shaded-client</artifactId>
        <version>${hbase-1.x-shaded-client.version}</version>
      </dependency>

    <dependency>
      <groupId>com.google.cloud.bigtable</groupId>
      <artifactId>bigtable-hbase-1.x</artifactId>
    </dependency>
clehene commented 4 years ago

@igorbernstein2 was there any progress on the refactoring you mentioned. In other words is there a change we'd get beam with hbase 2.x?

igorbernstein2 commented 4 years ago

Not yet

rravi-sift commented 8 months ago

@igorbernstein2, is there still a plan to get beam working with bigtable-hbase 2.x? We use a mono repo, and it's quite a hack to override the version down to 1.x just for beam support.