Closed Spacerat closed 6 years ago
Oh actually, I can see why this is the way it is now; to rematerialize the dataframe so that the underlying values of the categorical column generated for the partition key are (hopefully likely to be) the type you expect.
I guess what I’m wondering is: since the column ends up as a categorical anyway, does it really matter if it’s backed by strings or integers/dates/etc? And since the data doesn’t actually exist in the parquet files themselves, is it right to try to pretend it does?
My answers to the above two questions would be no, but if you disagree then I think the val_to_num
function needs to be made a bit more robust; e.g. by parsing things as dates first and parsing literals in a more deliberate way (or at the very least, white or blacklisting certain characters before parsing)
@Spacerat I think you meant to file this issue at https://github.com/dask/fastparquet/ ?
Oh gosh, yeah I did...
Re-opened here https://github.com/dask/fastparquet/issues/335
Due to this line (I think!): https://github.com/dask/fastparquet/blob/master/fastparquet/core.py#L347
The following code:
produces the following output
I would make a PR to fix this but I can't really fathom what the intention was here. Do you need fastparquet to parse certain partition values as literals for some reason or can I just remove the function call?