ironSource / parquetjs

fully asynchronous, pure JavaScript implementation of the Parquet file format
MIT License
346 stars 175 forks source link

Adding repeated properties to schema results in corrupt parquet file. #67

Closed dylandepass closed 6 years ago

dylandepass commented 6 years ago

Version 0.8.0

Having some issues with repeated. The resulting parquet file seems to have errors in it.

org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/PATHTOFILE/profile.parquet

Here is the code I'm testing with, it's the identities object that is causing the problem.



let schema = new parquet.ParquetSchema({
    person: {
        repeated: false,
        fields: {
            firstName: {
                type: 'UTF8'
            },
            lastName: {
                type: 'UTF8'
            }
        }
    },
    identities: {
        repeated: true,
        fields: {
            id: {
                type: 'UTF8'
            },
            xid: {
                type: 'UTF8'
            }
        }
    }
});

async function writeToParquet(schema) {
    // create new ParquetWriter that writes to 'fruits.parquet`
    var writer = await parquet.ParquetWriter.openFile(schema, 'profile.parquet');

    writer.appendRow({
        person: {
            firstName: "Test",
            lastName: "User"
        },
        identities: [{
            id: "ID",
            xid: "XID"
        },{
            id: "ID",
            xid: "XID"
        }]
    });

    await writer.close();
}

writeToParquet(schema);```
ZJONSSON commented 6 years ago

There is a bug in the RLE encoding that has probably been fixed here https://github.com/ironSource/parquetjs/pull/57, but not merged yet. See parquet-mr tests (rebased to the fix) here https://github.com/ironSource/parquetjs/pull/56

You can check out the PR branch by installing the last commit in the PR:

npm install zjonsson/parquetjs#07fb2fd8fc03bf2b57243531eaf91f2d60f5e460
ZJONSSON commented 6 years ago

See also https://github.com/ironSource/parquetjs/pull/43

dylandepass commented 6 years ago

Appreciate the help, that fixed my issue!