ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
351 stars 174 forks source link

Potential discepancy in shred/materialize #48

Closed ZJONSSON closed 6 years ago

ZJONSSON commented 6 years ago

Reading the spec carefully I see the following paragraph

One important thing to remember to understand the examples is that not every level of the tree needs a new definition or repetition level. Only repeated fields increment the repetition level, only non-required fields increment the definition level. As those levels are very small bounded values they can be encoded efficiently using a few bits.

Required fields are always defined and do not need a definition level. Non repeated fields do not need a repetition level.

This means that any path to a leaf node that has all path element as optional: false can only have a definition level of zero. (each definition level higher up needs an optional: true in the path)

However when I look at one of the materialize tests in parquetjs I see:

   var schema = new parquet.ParquetSchema({
      name: { type: 'UTF8' },
      stock: {
        repeated: true,
        fields: {
          quantity: { type: 'INT64', repeated: true },
          warehouse: { type: 'UTF8' },
        }
      },
      price: { type: 'DOUBLE' },
    });

  buffer.columnData[['stock',  'quantity']] = {
      dlevels: [2, 2, 2, 2, 0, 1],
      rlevels: [0, 1, 0, 2, 0, 0],
      values: [10, 20, 50, 75],
      count: 6
    };

Nothing in the path is optional, however many of the dlevels are non zero. If I change the dlevels to all zeros then the quantity data is not populated in the results records.

Is it possible that there is a discrepancy in the implementation?

ZJONSSON commented 6 years ago

I think I got it. Repeated is considered a Definition level in case it's empty. So repeated is essentially always optional in a sense that it can be empty.