Open TomAugspurger opened 6 years ago
wouldn’t support msgpack generally
For pickling, ensuring your ExtensionArray picklable should be enough (the rest already works, the blocks pickle the values). For geopandas we have this:
I am interested in discussing the parquet / arrow aspect of this issue (see eg also https://github.com/apache/arrow/issues/4168, and I would also like to potentially use this in GeoPandas).
In general, I think it would be useful to have an interface to convert an ExtensionArray to an arrow array, and the other way around.
Eg our IntegerArray EA would map perfectly to the arrow integer arrays. Having such well defined conversion can help in supporting EAs in to_parquet
(at least for the pyarrow engine), and might also be useful to convert in general to arrow data structures (eg fletcher might find this useful).
Personally, I think this responsibility should be on the ExtensionArray implementation itself (pyarrow should not need to know about all EAs, but only about a specific interface).
So eg we could add _to_arrow
/ _from_arrow
methods to the EA interface.
def _to_arrow(self, type=None) -> pyarrow.Array:
# default implementation (currently already works for basic cases)
return pyarrow.array(self, type=type)
@classmethod
def _from_arrow(cls, array : pyarrow.Array) -> ExtensionArray:
return ...
That would allow pyarrow to call those methods when receiving an ExtensionArray, and in principle (if they want that) even restore it when converting back to pandas if they store the extension dtype name in the pandas metadata (with something like pandas_dtype(stored_dtype_name).construct_array_type()._from_arrow(...)
)
Seems reasonable to reuse arrow where it has readers / writers.
Do we foresee problems with overloading to_arrow for both conversion to arrow and parquet IO? I don’t really see any problems right now.
Do we foresee problems with overloading to_arrow for both conversion to arrow and parquet IO? I don’t really see any problems right now.
One case I can think of is if you would have a custom data type that maps nicely to a certain arrow type (eg ragged arrays / multipolygons as lists of lists), but which is not supported in the specific IO option (eg the current arrow <-> parquet support for nested data is limited), so you might want to choose a suboptimal one for a certain IO option (eg binary blob instead).
But, 1) I am not sure we need to care about such case right now and 2) that might be solvable by being able to specify the resulting type (note the type
argument I have in _to_arrow
above, just like __array__(self, dtype=None)
in numpy's coercing) in the conversion, and you could then specify the required resulting schema when doing the conversion to arrow (eg might want to add a schema
keyword to to_parquet
anyway, regardless of this issue).
To have it less pandas EA-specific, arrow could also look for a __arrow_array__
method instead of _to_arrow
.
To update on the pyarrow aspect of this issue:
__arrow_array__
protocol in pyarrow to convert objects passed to pyarrow.array(..)
(https://github.com/apache/arrow/pull/5106)IntegerArray
, so DataFrames with nullable integer columns convert to arrow nicely (and write to parquet). They don't yet come back as nullable integer, though, see below.PeriodDtype
and IntervalDtype
when converting those to a native pyarrow type. See https://github.com/pandas-dev/pandas/pull/28371 for a WIP PR on that.Those improvements are about pandas -> arrow conversion. To also do arrow -> pandas (to roundtrip those pandas ExtensionArray columns), there is discussion in https://issues.apache.org/jira/browse/ARROW-2428.
A current proposal I did is to add a classmethod to pandas.ExtensionDtype
that knows how to create a pandas ExtensionArray from a pyarrow array (eg ExtensionDtype.__from_arrow__
or alike).
Are we OK with adding such a method in pandas?
Are we OK with adding such a method in pandas?
Sorry I'm a bit confused. This would be a method on pandas.ExtensionDtype
to create an pandas ExtensionArray
from an arrow Array
? Why wouldn't it be a method on pandas.ExtensionArray.__from_arrow__
?
That's also possible, but that's just another step of indirection (ExtensionDtype.construct_array_type().__from_arrow__
). As from pyarrow, we will typically have the dtype object available, not the array class. But in the end, that's not fundamentally different.
OK, either works. I just typical think of from_...
methods as classmethods returning an instance of the class. But this would just be a regular method.
Another reason to maybe do it on the dtype, is if you have multiple dtypes mapping to the same array class. Like we do with IntegerArray, or fletcher's FletcherArray (although in those two cases, there is a direct mapping of the arrow type to the pandas dtype, so that would also not be a problem)
Hello everyone, what is the status on this issue?
I am currently trying to properly handle datetime.date
and datetime.time
serialization to JSON, I've built an extension type but now how can I hook into the JSON (de)serialization?
Some of the formats listed in the top post are handled (pickle, parquet), but I don't think we already have a mechanism in place for hooking into JSON (de)serialization. And I don't think anybody already looked into this.
The actual conversion from a DataFrame into the json string is implemented in C:
Which values are exactly extracted from the DataFrame and handled in the C code is done at:
My quick guess is that if your ExtensionArray converts to a numpy array with "known" objects (eg datetime.date), I would expect that it can actually already work.
Long term, I don't know what the best solution would be if you want to do something more custom than having a numpy array with JSON serializable objects.
My extension divides the date into three numpy arrays: year, month and day - as you can expect, these are normal numbers and thus JSON serializable, however, they're being serialized as strings.
I think that something is happening between the inner storage and the serialization. It seems that before conversion, it converts the arrays into dates again.
I am open to suggestions, especially the name of the methods I need to implement to ensure that my dtype
gets placed in the schema and is used to deserialize when reading the JSON.
My PoC is available in https://gist.github.com/jmg-duarte/4a518f3c9ff484a575f336b71f62b0e1. The code is fairly simple, I based it on https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/span.py and reviewed the pandas code it points to.
Hello everyone, what is the status on this issue? I am currently trying to properly handle
datetime.date
anddatetime.time
serialization to JSON, I've built an extension type but now how can I hook into the JSON (de)serialization?
There are a few issues here:
String
), one can use convert_dtypes()
after doing a read to convert to the newer dtypes.datetime.date
and datetime.time
, I'm not sure why you are using an extension dtype, because they do (de)serialize with JSON using orient='table'
:In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '1.3.4'
In [3]: df = pd.DataFrame({"adate" : pd.to_datetime(["11/29/2021", "11/30/2021"])})
In [4]: df
Out[4]:
adate
0 2021-11-29
1 2021-11-30
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 adate 2 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 144.0 bytes
In [6]: rb = pd.read_json(df.to_json(orient="table"), orient="table")
In [7]: rb
Out[7]:
adate
0 2021-11-29
1 2021-11-30
In [8]: rb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 adate 2 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 32.0 bytes
3. For your case of handling
datetime.date
anddatetime.time
, I'm not sure why you are using an extension dtype, because they do (de)serialize with JSON usingorient='table'
:
@Dr-Irv your example is using datetime64 column, not datetime.date
or datetime.time
. And those are not properly handled at the moment. Starting with an object column of dates:
In [59]: import datetime
In [60]: df = pd.DataFrame({"date": [datetime.date.today()]})
In [61]: df.to_json(orient='table')
Out[61]: '{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"date","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"date":"2021-12-01T00:00:00.000Z"}]}'
So there are 2 problems with this:
"2021-12-01T00:00:00.000Z"
) and not a date."string"
type for the date column. That is because we currrently simply use "string" for any object dtype column, which is not really correct (this is also being discussed in https://github.com/pandas-dev/pandas/issues/32037)Addressing point 3: What you did is side-stepping the actual problem. Putting it simply, I have actual datetime.date
and datetime.time
objects inside the DataFrame.
The reason as to why I am using an ExtensionDtype
should be irrelevant (serialization should ideally work for ExtensionDtype
s as long as they adhere to the right interface), but anyway, I started because I wanted to see if I could make my ExtensionDtype
(de)serialize as I wanted (as one would expect) and thus fix my problem. During development I found that my custom type memory usage is up to 10x smaller.
So even without serialization, it already helps with another problem 🤷♂️, regardless, without proper serialization it doesn't help much.
I will be more explicit about what happens and what I expected to happen. I also provide some more context: I work for a company which handles timeseries data, as such we have DataFrames with columns where the dtype
is object
and the underlying Python type is datetime.date
and datetime.time
.
Currently, both datetime.date
and datetime.time
do not make a roundtrip when serializing to JSON, example bellow:
import datetime as dt
import pandas as pd
date_df = pd.DataFrame({"d": [dt.date.today()]})
date_df_json = date_df.to_json(orient='table')
date_df_back = pd.read_json(date_df_json, orient='table')
assert date_df.equals(date_df_back) # fails
time_df = pd.DataFrame({"t": [dt.time()]})
time_df_json = time_df.to_json(orient='table')
time_df_back = pd.read_json(time_df_json, orient='table')
assert time_df.equals(time_df_back) # fails
date_df_json
:
{'data': [{'d': '2021-12-01T00:00:00.000Z', 'index': 0}],
'schema': {'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'd', 'type': 'string'}],
'pandas_version': '0.20.0',
'primaryKey': ['index']}}
time_df_json
:
{'data': [{'index': 0, 't': '00:00:00'}],
'schema': {'fields': [{'name': 'index', 'type': 'integer'},
{'name': 't', 'type': 'string'}],
'pandas_version': '0.20.0',
'primaryKey': ['index']}}
What I was expecting was that type
would be equal to date
and time
, respectively.
The comment from @jorisvandenbossche in https://github.com/pandas-dev/pandas/issues/32037#issuecomment-588101173 makes a lot of sense and it is in fact, backwards compatible (I tested it, pandas ignores "unknown" fields and thus we can extend the metadata as we wish).
Furthermore, I agree with the comment from @WillAyd (https://github.com/pandas-dev/pandas/issues/32037#issuecomment-588248570), the extra metadata should solve the problem (at least for dates it could).
I think this precise topic (dates) should either be moved to a new issue or to #32037, whatever the best might be, I'm more than happy to move the discussion there (or not move it at all).
(just getting an issue number to link to. Will update later)
What hooks do we want to provide for ExtensionArrays stored inside our containers?
Currently, to_csv works. Haven't really tried others
xref https://github.com/pandas-dev/pandas/pull/20611/files