jhorstmann / compact-thrift

Thrift IDL parser and code generator for the compact protocol
Apache License 2.0
2 stars 0 forks source link

Investigate boolean encoding in vectors #7

Open jhorstmann opened 4 months ago

jhorstmann commented 4 months ago

The spec says

Field values are encoded directly in the field header. Element values of type bool are sent as an int8; true as 1 and false as 0.

and

The following element-types are used (see note below):

BOOL, encoded as 2

But it seems the ColumnIndex::null_pages field in alltypes_tiny_pages.parquet, written by parquet-mr version 1.12.0-SNAPSHOT (build 6901a2040848c6b37fa61f4b0a76246445f396db) encodes the element type as 1 and contains elements with value 2.

We probably need to be lenient and support both, decoding element values as byte_value == 1.

jhorstmann commented 4 months ago

The java implementation seems to clearly contradict the spec here

  public void writeBool(boolean b) throws TException {
    if (booleanField_ != null) {
      // we haven't written the field header yet
      writeFieldBeginInternal(booleanField_, b ? Types.BOOLEAN_TRUE : Types.BOOLEAN_FALSE);
      booleanField_ = null;
    } else {
      // we're not part of a field, so just write the value.
      writeByteDirect(b ? Types.BOOLEAN_TRUE : Types.BOOLEAN_FALSE);
    }
  }

And here, called via getCompactType and writeCollectionBegin.

  private static final byte[] ttypeToCompactType = new byte[18];

  static {
    ttypeToCompactType[TType.STOP] = TType.STOP;
    ttypeToCompactType[TType.BOOL] = Types.BOOLEAN_TRUE;
    ...
  }