Open oby1 opened 2 years ago
Thanks for bringing up the issue.
I tried forcing a strict schema type when converting back from pandas to pyarrow table here. Unfortunately the "trip" to pandas and back is not transparent. One type that ended up being tricky is a pa.timestamp. I was not able to make it work without implementing some weird conversion code which I am not sure if it will be robust enough.
Another approach that I tried is to use pyarrow.Table type as an argument to TransformSpec function (instead of pandas dataframe). However, working with pyarrow.Table type in the transform spec function appears to be inconvenient since pa.Table is immutable and pandas API is much more convenient for a transformation implementation.
So after doing all this, I would suggest sticking with the current implementation. While it's not perfect, I was not able to find a better alternative that would not require implementation of potentially non robust code.
Would appreciate your thought and suggestions on this matter.
Thanks for looking into this! What was the issue with pa.timestamp? I'm not seeing the timestamp-specific conversion code in https://github.com/uber/petastorm/pull/750.
For our purposes, using the proposed workaround is not a big deal as we only use a single TransformSpec
to implement nested array support as described here. The nested array support via TransformSpec
is itself a bit of a hack. Has any thought been given to natively supporting nested arrays?
The issue with timestamps I ran into was the automatic conversion of the timestamp into a datetime object - it would not be automatically converted back into pa.timestamp64. However, I just noticed that there is a date_as_object=False
object (*.to_pandas(date_as_object=False)) that let me keep dates from being converted to datetimes. Reopened the #750 - let's see if I can get all tests to pass now.
The conversion back to arrow from pandas in ArrowReaderWorker._load_rows() loses type information when all rows in the loaded row group are missing values for a given column. From the
pyarrow.Table.from_pandas
documentation:The result is the following error when reading the corresponding batch in the TensorFlow
Dataset
:Example
Workaround
Modify all
TransformSpec
funcs to replace string columns missing all values withNone
strings.Full Trace