kylebarron / parquet-wasm

Rust-based WebAssembly bindings to read and write Apache Parquet data
https://kylebarron.dev/parquet-wasm/
Apache License 2.0
482 stars 19 forks source link

Writing a Date column drops associated time information #413

Closed mbostock closed 6 months ago

mbostock commented 6 months ago

I think this is the same bug as https://github.com/duckdb/duckdb-wasm/issues/1231

Consider this test case:

import * as Arrow from "apache-arrow";
import * as Parquet from "parquet-wasm/node/arrow1.js";

const table = Arrow.tableFromArrays({test: [new Date("2012-01-01T12:34:56.789Z")]});
process.stdout.write(Parquet.writeParquet(Parquet.Table.fromIPCStream(Arrow.tableToIPC(table, "stream"))));

The resulting test column erroneously contains the value 2012-01-01 instead of 2012-01-01T12:34:56.789Z, dropping the associated time information.

kylebarron commented 6 months ago

This is actually an upstream arrow JS bug. Here's a repro case independent of parquet:

const arrow = require('apache-arrow');
const {writeFileSync} = require('fs');

const table = arrow.tableFromArrays({
  test: [new Date("2012-01-01T12:34:56.789Z")],
});
const buffer = arrow.tableToIPC(table, 'file')
writeFileSync('table.arrow', buffer)

and then in Python:

import pyarrow.feather as feather
table = feather.read_table('table.arrow')

table.schema
# test: date64[ms] not null

table.to_pandas()
#          test
# 0  2012-01-01

Also, if you look at the field info in JS before exporting to Python, you'll also see it's defined as a DateMillisecond type, which doesn't store any time information.

> table.schema.fields[0]
Field {
  name: 'test',
  type: DateMillisecond [Date] { unit: 1 },
  nullable: false,
  metadata: Map(0) {}
}

Closing as I don't think this is related to parquet-wasm, but happy to discuss further