HBaseResultCoder fails to serialize KeyValue instances

gsgalloway commented 7 years ago

HBaseResultCoder (in bigtable-hbase-dataflow) works well for Result instances that use the RowKey implementation of hbase's Cell interface, but when given a Result with hbase's KeyValue instead of Bigtable's RowKey it corrupts the output Result.

I couldn't find where exactly the breakdown occurs, but it seems related to the fact that a call to RowCell.toString() causes an ArrayIndexOutOfBoundsException.

A simple test for RowCell.toString():

import com.google.bigtable.repackaged.com.google.cloud.hbase.adapters.read.RowCell;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.KeyValue;

@Test
public void testStringifyRowCell() throws Exception {
    Cell keyValue = CellUtil.createCell("key".getBytes(), "family".getBytes(), "value".getBytes());
    Cell rowCell = new RowCell("key".getBytes(), "family".getBytes(), "qualifier".getBytes(), System.currentTimeMillis(), "value".getBytes());

    keyValue.toString();
    rowCell.toString();  // ArrayIndexOutOfBoundsException
}

Stack trace from test failure:

java.lang.ArrayIndexOutOfBoundsException: 27495

    at org.apache.hadoop.hbase.KeyValue.keyToString(KeyValue.java:1231)
    at org.apache.hadoop.hbase.KeyValue.keyToString(KeyValue.java:1190)
    at com.google.bigtable.repackaged.com.google.cloud.hbase.adapters.read.RowCell.toString(RowCell.java:234)

Another test illustrating HBaseCoder's incompatibility with KeyValue

import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellComparator;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Result;
import com.google.cloud.bigtable.dataflow.coders.HBaseResultCoder;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import static org.junit.Assert.assertTrue;

@Test
public void testHBaseResultCoderWithKeyValue() throws Exception {
    // Given
    // -----
    Cell inputKeyValue = CellUtil.createCell("key".getBytes(), "family".getBytes(), "value".getBytes());
    Result inputResult = Result.create(new Cell[]{inputKeyValue});

    HBaseResultCoder coder = HBaseResultCoder.getInstance();

    // When
    // -----
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    coder.encode(inputResult, outputStream, null);
    ByteArrayInputStream inputStream = new ByteArrayInputStream(outputStream.toByteArray());
    Result outputResult = coder.decode(inputStream, null);
    Cell outputRowCell = outputResult.listCells().get(0);

    // Constructor for KeyValue(Cell c) located here https://hbase.apache.org/1.2/xref/org/apache/hadoop/hbase/KeyValue.html#747
    Cell keyValueFromOutputRowCell = new KeyValue(outputRowCell);

    // Then
    // -----
    // Print statements to show unusual byte string in the decoded Cell
    System.out.println(inputKeyValue.toString());              // prints: key/family:value/LATEST_TIMESTAMP/Maximum/vlen=0/seqid=0
    System.out.println(keyValueFromOutputRowCell.toString());  // prints: key/\x00\x00\x00\x1A\x00\x00\x00\x00\x00\x03key\x06familyvalue\x7F\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD\xEF\xBF\xBD:\x00\x00\x00\x1A\x00\x00\x00\x00\x00\x03key\x06familyvalue\x7F\xFF\xFF\xFF\xFF\xFF\xFF\xFF\xFF/-1/Put/vlen=34/seqid=0

    // Output doesn't equal the input, regardless of whether output and input are the same Cell implementation
    assertTrue(CellComparator.equals(inputKeyValue, outputRowCell));  // AssertionError
    assertTrue(CellComparator.equals(inputKeyValue, keyValueFromOutputRowCell));  // AssertionError

    // Input and output RowArrays are unequal
    assertTrue(Arrays.areEqual(inputKeyValue.getRowArray(), outputRowCell.getRowArray()));  // AssertionError

    // Input and output Result instances are unequal
    Result.compareResults(inputResult, outputResult);   // ArrayIndexOutOfBoundsException caused by attempt to call RowCell.toString()
}

The relevant dependencies in our pom.xml:

        <dependency>
            <groupId>com.google.cloud.dataflow</groupId>
            <artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
            <version>1.9.0</version>
        </dependency>

        <dependency>
            <groupId>com.google.api-client</groupId>
            <artifactId>google-api-client</artifactId>
            <version>1.22.0</version>
            <exclusions>
                <!-- Exclude an old version of guava that is being pulled
                     in by a transitive dependency of google-api-client -->
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava-jdk5</artifactId>
                </exclusion>
            </exclusions>
        </dependency>

        <!-- This must be at least v0.9.3, due to a bug in the HBaseMutationCoder that
             breaks serialization of Put objects -->
        <dependency>
            <groupId>com.google.cloud.bigtable</groupId>
            <artifactId>bigtable-hbase-dataflow</artifactId>
            <version>0.9.6-SNAPSHOT</version>
        </dependency>

        <!-- bigtable-hbase-dataflow and bigtable-hbase-1.2 are incompatible-->
        <dependency>
            <groupId>com.google.cloud.bigtable</groupId>
            <artifactId>bigtable-hbase-shaded-for-dataflow</artifactId>
            <version>[0.9.5.1, 1.0.0)</version>
        </dependency>

plelevier commented 7 years ago

I've got the same problem here

FlatRowAdapter instance = new FlatRowAdapter();
FlatRow row = FlatRow.newBuilder().withRowKey(ByteString.copyFromUtf8("key1"))
    .addCell("family1", ByteString.copyFrom("qualifier1".getBytes()), 54321L, ByteString.copyFrom("value1".getBytes()))
    .build();
Result result = instance.adaptResponse(row);
result.toString(); // throws java.lang.ArrayIndexOutOfBoundsException

Here is the Stacktrace :

java.lang.ArrayIndexOutOfBoundsException: 27495
        at org.apache.hadoop.hbase.KeyValue.keyToString(KeyValue.java:1236)
        at org.apache.hadoop.hbase.KeyValue.keyToString(KeyValue.java:1195)
        at com.google.cloud.bigtable.hbase.adapters.read.RowCell.toString(RowCell.java:234)
        at org.apache.hadoop.hbase.client.Result.toString(Result.java:824)

sduskis commented 7 years ago

Thanks for finding the problems. We'll take a look at this. FWIW, Pull Requests are heartily welcome.

sduskis commented 7 years ago

RowCell.toString() turns out easy to fix. I'll get a new -SNAPSHOT out today with the fix.

HBaseResultCoder has 2 problems.

It's fundamentally broken when converting a KeyValue to a FlatRow.Cell in HBaseResultCoder.encode().
Even if a KeyValue value is translated to a FlatRow.Cell, HBaseResultCoder.decode() will create a RowCell instead of a KeyValue. That will still result in "// Input and output Result instances are unequal"

I use a Bigtable FlatRow to encode/decode HBase Results, which might be the wrong approach. We'll fix FlatRowAdapter for case 1), but that won't fix 2). I'm going to have to mull over these two issues.

igorbernstein2 commented 7 years ago

Here is a fix for 1: https://github.com/GoogleCloudPlatform/cloud-bigtable-client/pull/1232

sduskis commented 7 years ago

@gsgalloway and @plelevier, I'm curious about why you need this. The Cloud Bigtable version of HBaseResultCoder was intended to be used with CloudBigtableIO, which will create RowCells. I'm curious about your use case. Can you please elaborate on why you need this?

FWIW, There's a new HBase apache beam connector, that pretty much copied what we did. Here's there version of HBaseResultCoder.

gsgalloway commented 7 years ago

Sure thing. In short, we were attempting to unit test a DoFn that operates on Result instances and found our tooling for generating Results for testing uses KeyValue and not RowCell.

In more detail: we ran into this issue while unit testing a DoFn that computes on Result instances fetched from CloudBigtableIO and uses the library hbase-object-mapper to map from hbase Results to POJOs.

This library readily converts Results using both RowCells and KeyValues to POJOs, but when converting from POJOs to Results it only uses the KeyValue implementation.

The test that highlighted this issue:

@Test
public void testDoFn() throws Exception {
    // Given
    // -----
    HBaseObjectMapper objectMapper = new HBaseObjectMapper();
    OurPOJO givenPojo = new OurPOJO(. . .);

    // This Result is created using `KeyValue` and not `RowCell`
    Result inputResult = objectMapper.writeValueAsResult(givenPojo);

    Pipeline p = TestPipeline.create();

    PCollection<Result> inputPCollection = p.apply(Create.of(inputResult).withCoder(HBaseResultCoder.getInstance()));

    // When
    // -----
    PCollection<OurPOJO> output = input.apply(ParDo.of(new ConvertFromHBaseResultToOurPOJO()));

    // Then
    // -----
    DataflowAssert.thatSingleton(output).isEqualTo(givenPojo);
    p.run();
}

We were able to work around this issue by coercing the KeyValues into RowCells:

@Test
public void testDoFn() throws Exception {
    // Given
    // -----
    HBaseObjectMapper objectMapper = new HBaseObjectMapper();
    OurPOJO givenPojo = new OurPOJO(. . .);

    // Coerce KeyValues into RowCells for compatibility with HBaseResultCoder
    Result inputResult = objectMapper.writeValueAsResult(givenPojo);
    Cell inputKeyValue = inputResult.listCells().get(0);
    Cell convertedToRowCell = new RowCell(CellUtil.cloneRow(inputKeyValue), CellUtil.cloneFamily(inputKeyValue), CellUtil.cloneQualifier(inputKeyValue), inputKeyValue.getTimestamp(), CellUtil.cloneValue(inputKeyValue));
    Result inputResultUsingRowCells = Result.create(new Cell[]{convertedToRowCell});

    Pipeline p = TestPipeline.create();

    PCollection<Result> inputPCollection = p.apply(Create.of(inputResultUsingRowCells).withCoder(HBaseResultCoder.getInstance()));

    // When
    // -----
    PCollection<OurPOJO> output = input.apply(ParDo.of(new ConvertFromHBaseResultToOurPOJO()));

    // Then
    // -----
    DataflowAssert.thatSingleton(output).isEqualTo(givenPojo);
    p.run();
}

To clarify, HBaseResultCoder works just fine in production when it is encoding Results that come from CloudBigtableIO, but fails when we attempt to run unit tests with Results created otherwise.

While it makes sense that this issue should never affect individuals using the CloudBigtableIO connector, its name and the fact that it only works for certain Result implementations is a bit surprising.

At the least, it might warrant mention in the HBaseResultCoder documentation that it is only compatible with Results generated by the CloudBigtableIO connector

gsgalloway commented 7 years ago

Thanks for the tip about Beam's HBaseResultCoder, that certainly removes the need for that kludgy Cell conversion code

sduskis commented 7 years ago

The next release will have this fix.

googleapis / java-bigtable-hbase

HBaseResultCoder fails to serialize KeyValue instances #1227