Open mcaceresb opened 6 years ago
The question is what to do with data types that Stata doesn't support natively. These include:
The options are
In [32]: df['time'] = pd.to_datetime('2018-01-02')
In [33]: df
Out[33]:
a time
0 a 2018-01-02
1 b 2018-01-02
2 c 2018-01-02
3 d 2018-01-02
In [34]: df.to_parquet('test.parquet')
In [35]: pf = pq.ParquetFile('test.parquet')
In [36]: pf.schema
Out[36]:
<pyarrow._parquet.ParquetSchema object at 0x7f824ac50a08>
a: BYTE_ARRAY UTF8
time: INT64 TIMESTAMP_MILLIS
__index_level_0__: INT64
I think that making them doubles is the way to go.
Brownie points if it parses dates into a stata data format.
Brownie points if it parses dates into a stata data format.
I think the dates are from January 1, 1970, whereas in Stata they're compared to January 1, 1960, so a recomputation might be needed...
It seems that unix time is 1970, but for whatever reason Stata does 1960 (SAS?)
I suppose so. I didn't know SAS also had 1960 as epoch.
It seems that parquet has 8 data primitives; the above are built on those primitives, so the plugin should already be able to read all of these.
What ought to happen is that we should keep the formats somehow...
keep the formats somehow
What do you mean?
I mean that it would be ideal to keep the display format.
When writing or reading?
Both. Not sure if it's automagic when writing if the date tyoe is declared, but for sure that is not the case when reading. Atm it's treated as long or double.
I don't know how you would keep the display format in either.
I think when reading it would be more relevant than when writing. For the latter if the target is a date type I suppose that's enough.
When reading, Stata could format the variable after the data is in memory. Maybe not super crucial the more I think about it, but possibly convenient.
Well dates should be possible because there's a date type in Parquet... But you won't be able to retrieve like %8.0g
or whatever
See
arrow/cpp/src/arrow/builder.cc
. For instance,