jcrobak / parquet-python

python implementation of the parquet columnar file format.
Apache License 2.0
340 stars 257 forks source link

Hive-partitioned parquet files are broken #66

Closed Spacerat closed 6 years ago

Spacerat commented 6 years ago

Due to this line (I think!): https://github.com/dask/fastparquet/blob/master/fastparquet/core.py#L347

The following code:

import fastparquet
import pandas as pd

fastparquet.write('test.parquet', pd.DataFrame({
    'literal': ['40+2', '1e-10', '"5"', "2018-10-09", "2018-10-10"],
    'idx': [1, 2, 3, 4, 5]
}), partition_on=['literal'], file_scheme='hive')

fastparquet.ParquetFile('test.parquet').to_pandas()

produces the following output

screen shot 2018-05-15 at 9 56 13 pm

I would make a PR to fix this but I can't really fathom what the intention was here. Do you need fastparquet to parse certain partition values as literals for some reason or can I just remove the function call?

Spacerat commented 6 years ago

Oh actually, I can see why this is the way it is now; to rematerialize the dataframe so that the underlying values of the categorical column generated for the partition key are (hopefully likely to be) the type you expect.

I guess what I’m wondering is: since the column ends up as a categorical anyway, does it really matter if it’s backed by strings or integers/dates/etc? And since the data doesn’t actually exist in the parquet files themselves, is it right to try to pretend it does?

My answers to the above two questions would be no, but if you disagree then I think the val_to_num function needs to be made a bit more robust; e.g. by parsing things as dates first and parsing literals in a more deliberate way (or at the very least, white or blacklisting certain characters before parsing)

jcrobak commented 6 years ago

@Spacerat I think you meant to file this issue at https://github.com/dask/fastparquet/ ?

Spacerat commented 6 years ago

Oh gosh, yeah I did...

Spacerat commented 6 years ago

Re-opened here https://github.com/dask/fastparquet/issues/335