Open machow opened 11 months ago
IMO a nice approach would be to create a tiny class, SimpleFrame
, which has .to_pandas()
and to_polars()
methods. This means...
to_polars()
allow IDEs to do good Polars completionsThis should just involve implementing concretes in great_tables._tbl_data.py
for a SimpleFrame class.
@machow , I've come up with two ideas inspired by your content.
# data/__init__.py
class DataFrameProxy1:
def __init__(self, fname):
self._fname= fname
self._pandas = None
self._polars = None
@property
def pandas(self):
if self._pandas is None:
import pandas as pd
self._pandas = pd.read_csv(self._fname)
return self._pandas
@property
def polars(self):
if self._polars is None:
import polars as pl
# or using `pl.read_csv` directly, but need to
# be careful of setting `dtypes`
self._polars = pl.from_pandas(self.pandas)
return self._polars
air: DataFrameProxy1 = DataFrameProxy1(_airquality_fname) # type: ignore
Approach 1
allows us to use air.pandas
and air.polars
to select the desired dataframe. Although this will break the current syntax, it's an initiative I find promising.
# data/__init__.py
class DataFrameProxy2(DataFrameProxy1):
def __getattr__(self, name):
return getattr(self.pandas, name)
air: DataFrameProxy2 = DataFrameProxy2(_airquality_fname) # type: ignore
Approach 2
allows us to treat air
in a more Pandas-ish
manner, as shown in the code below (although we still need to use air.polars
to get the Polars
dataframe).
>>> from great_tables.data import air
>>> air.head()
Ozone Solar_R Wind Temp Month Day
0 41.0 190.0 7.4 67 5 1
1 36.0 118.0 8.0 72 5 2
2 12.0 149.0 12.6 74 5 3
3 18.0 313.0 11.5 62 5 4
4 NaN NaN 14.3 56 5 5
>>> air.assign(NewDay=lambda df_: df_.Day.add(1))
Ozone Solar_R Wind Temp Month Day NewDay
0 41.0 190.0 7.4 67 5 1 2
1 36.0 118.0 8.0 72 5 2 3
2 12.0 149.0 12.6 74 5 3 4
3 18.0 313.0 11.5 62 5 4 5
4 NaN NaN 14.3 56 5 5 6
.. ... ... ... ... ... ... ...
148 30.0 193.0 6.9 70 9 26 27
149 NaN 145.0 13.2 77 9 27 28
150 14.0 191.0 14.3 75 9 28 29
151 18.0 131.0 8.0 76 9 29 30
152 20.0 223.0 11.5 68 9 30 31
[153 rows x 7 columns]
However, this will cause GT(air)
to fail since it's no longer recognized as a Pandas
dataframe. Therefore, we would need to modify the code to store the truly underlying _tbl_data
from air.pandas
.
These two approaches appear to be related to issue #8.
Currently, great tables includes over a dozen datasets in its
.data
submodule:However, these datasets are pandas DataFrames, so polars users need to convert them:
This isn't too bad. But maybe it could be better? This issue will discuss various ways we could approach loading data for both pandas and polars users.
This is mostly me thinking out loud about different options, without a strong opinion on an approach yet š .
Possible approaches
pl.from_pandas()
to convert.set_options(data_frame = pl.DataFrame)
.set_options(data_frame="polars")
.exibble(pl.DataFrame)
, orexibble("polars")
, orexibble() # uses set_options() to get DataFrame
from great_tables.data.polars import airquality
, ORfrom great_tables.data_pl import airquality
, ORfrom some_data_package import airquality
Desirable outcomes
Easy to perform
For example, if data is simply imported and
set_options()
is used, then people will need to do some code in-between imports. This feels a cludgy.Here's an example:
At the same time, calling functions gets kind of annoying:
Helpful DataFrame completions in IDE
I'm not sure how to implement something like
set_options()
and a data fetcher likeairquality()
? Is there a way to type it, so tools like pyright know that when an option is set to a specific value, that airquality() returns a specific type of DataFrame?Last thoughts
I like the idea of us having a
great_tables.data_pl
submodule, or even a separate data package for datasets, but am curious what seems most useful to folks!