manojkarthick / pqrs

Command line tool for inspecting Parquet files
Apache License 2.0
294 stars 29 forks source link

Timestamp CSV conversion issue #47

Open CalderWhite opened 1 year ago

CalderWhite commented 1 year ago

There is an issue when running pqrs cat --csv [infile] without timestamp objects. For me they are all set to 1970-01-01. But when I cat with json, it is fine. parquet-tools can convert it to csv fine.

I suspect something with an integer overflow?

SteveLauC commented 8 months ago

There is an issue when running pqrs cat --csv [infile] without timestamp objects

Do you mean with timestamps?

For me they are all set to 1970-01-01. But when I cat with json, it is fine. parquet-tools can convert it to csv fine.

Would you like to provide the data the causes this bug? Specifically, the type of this timestamp field and its value.


I try to reproduce this issue, I created a parquet file with the following code:

use datafusion::{
    arrow::{
        array::TimestampSecondArray,
        datatypes::{DataType, Field, Schema, TimeUnit},
        record_batch::RecordBatch,
    },
    parquet::arrow::ArrowWriter,
};
use std::{fs::OpenOptions, sync::Arc};

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let schema = Arc::new(Schema::new(vec![Field::new(
        "timestamp",
        DataType::Timestamp(TimeUnit::Second, None),
        true,
    )]));
    let timestamp_column = Arc::new(TimestampSecondArray::from(vec![1709090622]));
    let batch = RecordBatch::try_new(Arc::clone(&schema), vec![timestamp_column]).unwrap();

    let file = OpenOptions::new()
        .write(true)
        .create(true)
        .open("test.parquet")
        .unwrap();
    let mut writer = ArrowWriter::try_new(file, Arc::clone(&schema), None).unwrap();
    writer.write(&batch).unwrap();

    writer.close().unwrap();
}

But as you can see, the timestamp was successfully printed:

$ cargo r -q
$ l test.parquet
Permissions Links Size User  Group Date Modified Name
.rw-r--r--@     1  580 steve steve 28 Feb 11:24  test.parquet

$ pqrs --version
pqrs 0.3.1

$ pqrs cat test.parquet

##################
File: test.parquet
##################

{timestamp: 1709090622}

$ pqrs cat --csv test.parquet

##################
File: test.parquet
##################

timestamp
2024-02-28T03:23:42.000000000