hgrecco / pint-pandas

Pandas support for pint
Other
169 stars 42 forks source link

Interoperability with pd.to_sql? #123

Open MichaelTiemannOSC opened 2 years ago

MichaelTiemannOSC commented 2 years ago

What is the recommended way to write and read pint-pandas dataframes to and from SQL databases?

Is it immediately obvious that the answer should be to convert to/from JSON? In my case, I'm interested in writing to/from a TRINO database via SQLAlchemy, but happy to see answers that work for PostgreSQL, SQLite, etc. Here's a test case:

https://github.com/os-climate/data-platform-demo/blob/master/notebooks/pint-demo.ipynb

Relevant snippet:

ureg.define("CO2e = CO2 = CO2eq = CO2_eq")
co2_df = pd.DataFrame({'co2': pd.Series([Q_(1.0, 'Mt CO2'), Q_(0.9, 'Mt CO2'), Q_(0.8, 'Mt CO2')],
                                        dtype='pint[Mt CO2]'),
                       'year': [2016, 2017, 2018]})
co2_df.to_sql(ingest_table, con=engine, schema=ingest_schema, if_exists='append', index=False,
              method=osc.TrinoBatchInsert(batch_size = 100, verbose = True, optimize = True))
MichaelTiemannOSC commented 2 years ago

Answering my own question: need to change YEAR series to pint[dimensionless] and then write co2_df.pint.dequantify() to get a SQL table that works.

MichaelTiemannOSC commented 2 years ago

I have updated the demo notebook to show the hoops through which one must jump in order to unpack the results that come back from pd.read_sql. The relevant snippet is:

new_co2_df = pd.DataFrame({col:series.astype(f"pint[{unit.replace('co2', 'CO2')}]")
                          for col, unit, series in zip(list(map(lambda x: ast.literal_eval(x)[0], new_df.T.apply(lambda x: x.index, axis=1).index.values)),
                                                       list(map(lambda x: ast.literal_eval(x)[1], new_df.T.apply(lambda x: x.index, axis=1).index.values)),
                                                       [v for v in new_df.to_dict(orient='series').values()])})

Re-opening in case there's an obviously better way to do this.

MichaelTiemannOSC commented 2 years ago

I have updated the notebook to do a full round-trip of a dataframe that includes both non-quantified data (strings), as well as with both homogeneous units in a column as well as heterogeneous units in a column.