mcaceresb / stata-parquet

Read and write parquet files from Stata
MIT License
22 stars 6 forks source link

Support all parquet types? #9

Open mcaceresb opened 6 years ago

mcaceresb commented 6 years ago

See arrow/cpp/src/arrow/builder.cc. For instance,

      BUILDER_CASE(UINT8, UInt8Builder);
      BUILDER_CASE(INT8, Int8Builder);
      BUILDER_CASE(UINT16, UInt16Builder);
      BUILDER_CASE(INT16, Int16Builder);
      BUILDER_CASE(UINT32, UInt32Builder);
      BUILDER_CASE(INT32, Int32Builder);
      BUILDER_CASE(UINT64, UInt64Builder);
      BUILDER_CASE(INT64, Int64Builder);
      BUILDER_CASE(DATE32, Date32Builder);
      BUILDER_CASE(DATE64, Date64Builder);
      BUILDER_CASE(TIME32, Time32Builder);
      BUILDER_CASE(TIME64, Time64Builder);
      BUILDER_CASE(TIMESTAMP, TimestampBuilder);
      BUILDER_CASE(BOOL, BooleanBuilder);
      BUILDER_CASE(HALF_FLOAT, HalfFloatBuilder);
      BUILDER_CASE(FLOAT, FloatBuilder);
      BUILDER_CASE(DOUBLE, DoubleBuilder);
      BUILDER_CASE(STRING, StringBuilder);
      BUILDER_CASE(BINARY, BinaryBuilder);
      BUILDER_CASE(FIXED_SIZE_BINARY, FixedSizeBinaryBuilder);
      BUILDER_CASE(DECIMAL, Decimal128Builder);
kylebarron commented 6 years ago

The question is what to do with data types that Stata doesn't support natively. These include:

The options are

  1. raise an error. Not great because some of these are written by default from, say, pandas. Datetimes are written as Int64 by default.
In [32]: df['time'] = pd.to_datetime('2018-01-02')

In [33]: df
Out[33]: 
   a       time
0  a 2018-01-02
1  b 2018-01-02
2  c 2018-01-02
3  d 2018-01-02

In [34]: df.to_parquet('test.parquet')

In [35]: pf = pq.ParquetFile('test.parquet')

In [36]: pf.schema
Out[36]: 
<pyarrow._parquet.ParquetSchema object at 0x7f824ac50a08>
a: BYTE_ARRAY UTF8
time: INT64 TIMESTAMP_MILLIS
__index_level_0__: INT64
  1. Try to coerce to a Stata-capable format (double?). Doubles have huge integer precision still
mcaceresb commented 6 years ago

I think that making them doubles is the way to go.

mcaceresb commented 6 years ago

Brownie points if it parses dates into a stata data format.

kylebarron commented 6 years ago

Brownie points if it parses dates into a stata data format.

I think the dates are from January 1, 1970, whereas in Stata they're compared to January 1, 1960, so a recomputation might be needed...

mcaceresb commented 6 years ago

It seems that unix time is 1970, but for whatever reason Stata does 1960 (SAS?)

kylebarron commented 6 years ago

I suppose so. I didn't know SAS also had 1960 as epoch.

mcaceresb commented 6 years ago

It seems that parquet has 8 data primitives; the above are built on those primitives, so the plugin should already be able to read all of these.

What ought to happen is that we should keep the formats somehow...

kylebarron commented 6 years ago

keep the formats somehow

What do you mean?

mcaceresb commented 6 years ago

I mean that it would be ideal to keep the display format.

kylebarron commented 6 years ago

When writing or reading?

mcaceresb commented 6 years ago

Both. Not sure if it's automagic when writing if the date tyoe is declared, but for sure that is not the case when reading. Atm it's treated as long or double.

kylebarron commented 6 years ago

I don't know how you would keep the display format in either.

mcaceresb commented 6 years ago

I think when reading it would be more relevant than when writing. For the latter if the target is a date type I suppose that's enough.

When reading, Stata could format the variable after the data is in memory. Maybe not super crucial the more I think about it, but possibly convenient.

kylebarron commented 6 years ago

Well dates should be possible because there's a date type in Parquet... But you won't be able to retrieve like %8.0g or whatever