jorgecarleitao / arrow2

Transmute-free Rust library to work with the Arrow format
Apache License 2.0
1.07k stars 223 forks source link

Cannot read parquet file with arrow2 but can with pyarrow #1473

Closed twitu closed 1 year ago

twitu commented 1 year ago

I have this parquet file where the arrow2-0.17.0 parquet file reader does not return any data.

I have created the file using pyarrow. And I have double checked that pyarrow and datafusion can read it. I've also checked that the metadata and schema are loaded correctly by arrow2 reader. But no chunks are returned by the reader.

Unlike #1370 the schema for my file is pretty simple.

Schema {
    fields: [
        Field {
            name: "bid",
            data_type: Int64,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "ask",
            data_type: Int64,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "bid_size",
            data_type: UInt64,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "ask_size",
            data_type: UInt64,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "ts_event",
            data_type: UInt64,
            is_nullable: true,
            metadata: {},
        },
        Field {
            name: "ts_init",
            data_type: UInt64,
            is_nullable: true,
            metadata: {},
        },
    ],

And even the row groups are read correctly.

&row_groups = [
    RowGroupMetaData {
        columns: [
            ColumnChunkMetaData {
                column_chunk: ColumnChunk {
                    file_path: None,
                    file_offset: 6590314,
                    meta_data: Some(
                        ColumnMetaData {
                            type_: Type(
                                2,
                            ),
                            encodings: [
                                Encoding(
                                    8,
                                ),
                                Encoding(
                                    0,
                                ),
                                Encoding(
                                    3,
                                ),
                            ],
                            path_in_schema: [
                                "bid",
                            ],
                            codec: CompressionCodec(
                                1,
                            ),
                            num_values: 9689614,
                            total_uncompressed_size: 7215695,
                            total_compressed_size: 6590310,
                            key_value_metadata: None,
                            data_page_offset: 227,
                            index_page_offset: None,
                            dictionary_page_offset: Some(
                                4,
                            ),

I'm not sure what is going wrong here. Do you have any suggestions?

twitu commented 1 year ago

Here's the test data file with the same schema and 10 records.

test_data.parquet.zip

twitu commented 1 year ago

This test fails.

#[test]
fn arrow2_test() {
    let mut reader = File::open("test_data.parquet").expect("Unable to open given file");
    let metadata = read::read_metadata(&mut reader).expect("Unable to read metadata");
    let schema = read::infer_schema(&metadata).expect("Unable to infer schema");
    let mut fr = FileReader::new(
        reader,
        metadata.row_groups,
        schema,
        Some(1000),
        None,
        None,
    );
    assert!(fr.next().is_some())
}
twitu commented 1 year ago

This is a non-issue. I had to enable the io_parquet_compression feature to get this working.