pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.27k stars 17.79k forks source link

Serialization / Deserialization of ExtensionArrays #20612

Open TomAugspurger opened 6 years ago

TomAugspurger commented 6 years ago

(just getting an issue number to link to. Will update later)

What hooks do we want to provide for ExtensionArrays stored inside our containers?

Currently, to_csv works. Haven't really tried others

xref https://github.com/pandas-dev/pandas/pull/20611/files

jreback commented 6 years ago

wouldn’t support msgpack generally

jorisvandenbossche commented 6 years ago

For pickling, ensuring your ExtensionArray picklable should be enough (the rest already works, the blocks pickle the values). For geopandas we have this:

https://github.com/geopandas/geopandas/blob/9cb93940bc2ac94c5d02651fe8a4fa50220827f2/geopandas/array.py#L135-L141

jorisvandenbossche commented 5 years ago

I am interested in discussing the parquet / arrow aspect of this issue (see eg also https://github.com/apache/arrow/issues/4168, and I would also like to potentially use this in GeoPandas).

In general, I think it would be useful to have an interface to convert an ExtensionArray to an arrow array, and the other way around. Eg our IntegerArray EA would map perfectly to the arrow integer arrays. Having such well defined conversion can help in supporting EAs in to_parquet(at least for the pyarrow engine), and might also be useful to convert in general to arrow data structures (eg fletcher might find this useful).

Personally, I think this responsibility should be on the ExtensionArray implementation itself (pyarrow should not need to know about all EAs, but only about a specific interface).

So eg we could add _to_arrow / _from_arrow methods to the EA interface.

def _to_arrow(self, type=None) -> pyarrow.Array:
    # default implementation (currently already works for basic cases)
    return pyarrow.array(self, type=type)

@classmethod
def _from_arrow(cls, array : pyarrow.Array) -> ExtensionArray:
    return ...

That would allow pyarrow to call those methods when receiving an ExtensionArray, and in principle (if they want that) even restore it when converting back to pandas if they store the extension dtype name in the pandas metadata (with something like pandas_dtype(stored_dtype_name).construct_array_type()._from_arrow(...))

TomAugspurger commented 5 years ago

Seems reasonable to reuse arrow where it has readers / writers.

Do we foresee problems with overloading to_arrow for both conversion to arrow and parquet IO? I don’t really see any problems right now.

jorisvandenbossche commented 5 years ago

Do we foresee problems with overloading to_arrow for both conversion to arrow and parquet IO? I don’t really see any problems right now.

One case I can think of is if you would have a custom data type that maps nicely to a certain arrow type (eg ragged arrays / multipolygons as lists of lists), but which is not supported in the specific IO option (eg the current arrow <-> parquet support for nested data is limited), so you might want to choose a suboptimal one for a certain IO option (eg binary blob instead). But, 1) I am not sure we need to care about such case right now and 2) that might be solvable by being able to specify the resulting type (note the type argument I have in _to_arrow above, just like __array__(self, dtype=None) in numpy's coercing) in the conversion, and you could then specify the required resulting schema when doing the conversion to arrow (eg might want to add a schema keyword to to_parquet anyway, regardless of this issue).

jorisvandenbossche commented 5 years ago

To have it less pandas EA-specific, arrow could also look for a __arrow_array__ method instead of _to_arrow.

jorisvandenbossche commented 5 years ago

To update on the pyarrow aspect of this issue:

Those improvements are about pandas -> arrow conversion. To also do arrow -> pandas (to roundtrip those pandas ExtensionArray columns), there is discussion in https://issues.apache.org/jira/browse/ARROW-2428. A current proposal I did is to add a classmethod to pandas.ExtensionDtype that knows how to create a pandas ExtensionArray from a pyarrow array (eg ExtensionDtype.__from_arrow__ or alike). Are we OK with adding such a method in pandas?

TomAugspurger commented 5 years ago

Are we OK with adding such a method in pandas?

Sorry I'm a bit confused. This would be a method on pandas.ExtensionDtype to create an pandas ExtensionArray from an arrow Array? Why wouldn't it be a method on pandas.ExtensionArray.__from_arrow__?

jorisvandenbossche commented 5 years ago

That's also possible, but that's just another step of indirection (ExtensionDtype.construct_array_type().__from_arrow__). As from pyarrow, we will typically have the dtype object available, not the array class. But in the end, that's not fundamentally different.

TomAugspurger commented 5 years ago

OK, either works. I just typical think of from_... methods as classmethods returning an instance of the class. But this would just be a regular method.

jorisvandenbossche commented 4 years ago

Another reason to maybe do it on the dtype, is if you have multiple dtypes mapping to the same array class. Like we do with IntegerArray, or fletcher's FletcherArray (although in those two cases, there is a direct mapping of the arrow type to the pandas dtype, so that would also not be a problem)

jmg-duarte commented 2 years ago

Hello everyone, what is the status on this issue? I am currently trying to properly handle datetime.date and datetime.time serialization to JSON, I've built an extension type but now how can I hook into the JSON (de)serialization?

jorisvandenbossche commented 2 years ago

Some of the formats listed in the top post are handled (pickle, parquet), but I don't think we already have a mechanism in place for hooking into JSON (de)serialization. And I don't think anybody already looked into this.

The actual conversion from a DataFrame into the json string is implemented in C:

https://github.com/pandas-dev/pandas/blob/d7eaddedcf114362b28715729530df1f24584f10/pandas/_libs/src/ujson/python/objToJSON.c#L1957

Which values are exactly extracted from the DataFrame and handled in the C code is done at:

https://github.com/pandas-dev/pandas/blob/d7eaddedcf114362b28715729530df1f24584f10/pandas/_libs/src/ujson/python/objToJSON.c#L216

My quick guess is that if your ExtensionArray converts to a numpy array with "known" objects (eg datetime.date), I would expect that it can actually already work.

Long term, I don't know what the best solution would be if you want to do something more custom than having a numpy array with JSON serializable objects.

jmg-duarte commented 2 years ago

My extension divides the date into three numpy arrays: year, month and day - as you can expect, these are normal numbers and thus JSON serializable, however, they're being serialized as strings.

I think that something is happening between the inner storage and the serialization. It seems that before conversion, it converts the arrays into dates again.

I am open to suggestions, especially the name of the methods I need to implement to ensure that my dtype gets placed in the schema and is used to deserialize when reading the JSON.

My PoC is available in https://gist.github.com/jmg-duarte/4a518f3c9ff484a575f336b71f62b0e1. The code is fairly simple, I based it on https://github.com/CODAIT/text-extensions-for-pandas/blob/master/text_extensions_for_pandas/array/span.py and reviewed the pandas code it points to.

Dr-Irv commented 2 years ago

Hello everyone, what is the status on this issue? I am currently trying to properly handle datetime.date and datetime.time serialization to JSON, I've built an extension type but now how can I hook into the JSON (de)serialization?

There are a few issues here:

  1. For our built-in extension dtypes (e.g., String), one can use convert_dtypes() after doing a read to convert to the newer dtypes.
  2. For user-provided extension dtypes, we have to figure out a way of handling this
  3. For your case of handling datetime.date and datetime.time, I'm not sure why you are using an extension dtype, because they do (de)serialize with JSON using orient='table' :
In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.3.4'

In [3]: df = pd.DataFrame({"adate" : pd.to_datetime(["11/29/2021", "11/30/2021"])})

In [4]: df
Out[4]:
       adate
0 2021-11-29
1 2021-11-30

In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   adate   2 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 144.0 bytes

In [6]: rb = pd.read_json(df.to_json(orient="table"), orient="table")

In [7]: rb
Out[7]:
       adate
0 2021-11-29
1 2021-11-30

In [8]: rb.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   adate   2 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 32.0 bytes
jorisvandenbossche commented 2 years ago

3. For your case of handling datetime.date and datetime.time, I'm not sure why you are using an extension dtype, because they do (de)serialize with JSON using orient='table' :

@Dr-Irv your example is using datetime64 column, not datetime.date or datetime.time. And those are not properly handled at the moment. Starting with an object column of dates:

In [59]: import datetime

In [60]: df = pd.DataFrame({"date": [datetime.date.today()]})

In [61]: df.to_json(orient='table')
Out[61]: '{"schema":{"fields":[{"name":"index","type":"integer"},{"name":"date","type":"string"}],"primaryKey":["index"],"pandas_version":"0.20.0"},"data":[{"index":0,"date":"2021-12-01T00:00:00.000Z"}]}'

So there are 2 problems with this:

jmg-duarte commented 2 years ago

Addressing point 3: What you did is side-stepping the actual problem. Putting it simply, I have actual datetime.date and datetime.time objects inside the DataFrame.

The reason as to why I am using an ExtensionDtype should be irrelevant (serialization should ideally work for ExtensionDtypes as long as they adhere to the right interface), but anyway, I started because I wanted to see if I could make my ExtensionDtype (de)serialize as I wanted (as one would expect) and thus fix my problem. During development I found that my custom type memory usage is up to 10x smaller. So even without serialization, it already helps with another problem 🤷‍♂️, regardless, without proper serialization it doesn't help much.

I will be more explicit about what happens and what I expected to happen. I also provide some more context: I work for a company which handles timeseries data, as such we have DataFrames with columns where the dtype is object and the underlying Python type is datetime.date and datetime.time.

Currently, both datetime.date and datetime.time do not make a roundtrip when serializing to JSON, example bellow:

import datetime as dt
import pandas as pd

date_df = pd.DataFrame({"d": [dt.date.today()]})
date_df_json = date_df.to_json(orient='table')
date_df_back = pd.read_json(date_df_json, orient='table')
assert date_df.equals(date_df_back) # fails

time_df = pd.DataFrame({"t": [dt.time()]})
time_df_json = time_df.to_json(orient='table')
time_df_back = pd.read_json(time_df_json, orient='table')
assert time_df.equals(time_df_back) # fails

date_df_json:

{'data': [{'d': '2021-12-01T00:00:00.000Z', 'index': 0}],
 'schema': {'fields': [{'name': 'index', 'type': 'integer'},
                       {'name': 'd', 'type': 'string'}],
            'pandas_version': '0.20.0',
            'primaryKey': ['index']}}

time_df_json:

{'data': [{'index': 0, 't': '00:00:00'}],
 'schema': {'fields': [{'name': 'index', 'type': 'integer'},
                       {'name': 't', 'type': 'string'}],
            'pandas_version': '0.20.0',
            'primaryKey': ['index']}}

What I was expecting was that type would be equal to date and time, respectively.

The comment from @jorisvandenbossche in https://github.com/pandas-dev/pandas/issues/32037#issuecomment-588101173 makes a lot of sense and it is in fact, backwards compatible (I tested it, pandas ignores "unknown" fields and thus we can extend the metadata as we wish).

Furthermore, I agree with the comment from @WillAyd (https://github.com/pandas-dev/pandas/issues/32037#issuecomment-588248570), the extra metadata should solve the problem (at least for dates it could).

I think this precise topic (dates) should either be moved to a new issue or to #32037, whatever the best might be, I'm more than happy to move the discussion there (or not move it at all).