Writing a Date column drops associated time information

kylebarron / parquet-wasm

Rust-based WebAssembly bindings to read and write Apache Parquet data

Apache License 2.0

482 stars 19 forks source link

import * as Arrow from "apache-arrow"; import * as Parquet from "parquet-wasm/node/arrow1.js"; const table = Arrow.tableFromArrays({test: [new Date("2012-01-01T12:34:56.789Z")]}); process.stdout.write(Parquet.writeParquet(Parquet.Table.fromIPCStream(Arrow.tableToIPC(table, "stream"))));

This is actually an upstream arrow JS bug. Here's a repro case independent of parquet:

const arrow = require('apache-arrow');
const {writeFileSync} = require('fs');

const table = arrow.tableFromArrays({
  test: [new Date("2012-01-01T12:34:56.789Z")],
});
const buffer = arrow.tableToIPC(table, 'file')
writeFileSync('table.arrow', buffer)

and then in Python:

import pyarrow.feather as feather
table = feather.read_table('table.arrow')

table.schema
# test: date64[ms] not null

table.to_pandas()
#          test
# 0  2012-01-01

Also, if you look at the field info in JS before exporting to Python, you'll also see it's defined as a DateMillisecond type, which doesn't store any time information.

> table.schema.fields[0]
Field {
  name: 'test',
  type: DateMillisecond [Date] { unit: 1 },
  nullable: false,
  metadata: Map(0) {}
}

Closing as I don't think this is related to parquet-wasm, but happy to discuss further

kylebarron / parquet-wasm

Writing a Date column drops associated time information #413