Enable polars users to easily access to package datasets

Currently, great tables includes over a dozen datasets in its .data submodule:

# all datasets live in submodule
from great_tables.data import airquality, exibble, towny

# the exibble dataset can be fetched from the top-level
# which allows us to quickly churn out examples
from great_tables import exibble

However, these datasets are pandas DataFrames, so polars users need to convert them:

import polars as pl

pl.from_dataframe(exibble)

This isn't too bad. But maybe it could be better? This issue will discuss various ways we could approach loading data for both pandas and polars users.

This is mostly me thinking out loud about different options, without a strong opinion on an approach yet 😅.

Possible approaches

Leave as is. Polars folks use pl.from_pandas() to convert.
Use options object to configure DataFrame constructor.
- E.g. set_options(data_frame = pl.DataFrame).
- E.g. set_options(data_frame="polars").
Use functions to fetch each dataset. They could take a constructor argument.
- E.g. exibble(pl.DataFrame), or
- E.g. exibble("polars"), or
- E.g. exibble() # uses set_options() to get DataFrame
pandas and polars each gets its own data module.
- E.g. from great_tables.data.polars import airquality, OR
- E.g. from great_tables.data_pl import airquality, OR
- E.g. from some_data_package import airquality

Desirable outcomes

Easy to perform
Helpful DataFrame completions in IDE

Easy to perform

For example, if data is simply imported and set_options() is used, then people will need to do some code in-between imports. This feels a cludgy.

Here's an example:

from great_tables import set_options

set_options(data_frame = "polars")

from great_tables.data import airquality

At the same time, calling functions gets kind of annoying:

from great_tables import GT, set_options
from great_tables.data import airquality

set_options(data_frame = "polars")

# annoying to call, but imports can be up top
# you can change the option and call airquality again
# to get data based on the current data_frame option
df_airquality = airquality()

# maybe the workaround is calling inside GT
GT(airquality())

Helpful DataFrame completions in IDE

I'm not sure how to implement something like set_options() and a data fetcher like airquality()? Is there a way to type it, so tools like pyright know that when an option is set to a specific value, that airquality() returns a specific type of DataFrame?

from great_tables import set_options
from great_tabels.data import airquality

set_options(data_frame = "polars")
airquality.<tab>                                           # shows polars methods

Last thoughts

I like the idea of us having a great_tables.data_pl submodule, or even a separate data package for datasets, but am curious what seems most useful to folks!

IMO a nice approach would be to create a tiny class, SimpleFrame, which has .to_pandas() and to_polars() methods. This means...

SimpleFrame could be input to GT (i.e. DataFrame lib agnostic rendering)
SimpleFrame methods like to_polars() allow IDEs to do good Polars completions

This should just involve implementing concretes in great_tables._tbl_data.py for a SimpleFrame class.

@machow , I've come up with two ideas inspired by your content.

Approach 1

# data/__init__.py

class DataFrameProxy1:
    def __init__(self, fname):
        self._fname= fname
        self._pandas = None
        self._polars = None

    @property
    def pandas(self):
        if self._pandas is None:
            import pandas as pd
            self._pandas = pd.read_csv(self._fname)
        return self._pandas

    @property
    def polars(self):
        if self._polars is None:
            import polars as pl
            # or using `pl.read_csv` directly, but need to
            # be careful of setting `dtypes`
            self._polars = pl.from_pandas(self.pandas)
        return self._polars

air: DataFrameProxy1 = DataFrameProxy1(_airquality_fname)  # type: ignore

Approach 1 allows us to use air.pandas and air.polars to select the desired dataframe. Although this will break the current syntax, it's an initiative I find promising.

Approach 2

# data/__init__.py

class DataFrameProxy2(DataFrameProxy1):
    def __getattr__(self, name):
        return getattr(self.pandas, name)

air: DataFrameProxy2 = DataFrameProxy2(_airquality_fname)  # type: ignore

Approach 2 allows us to treat air in a more Pandas-ish manner, as shown in the code below (although we still need to use air.polars to get the Polars dataframe).

>>> from great_tables.data import air
>>> air.head()
   Ozone  Solar_R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5
>>> air.assign(NewDay=lambda df_: df_.Day.add(1))
     Ozone  Solar_R  Wind  Temp  Month  Day  NewDay
0     41.0    190.0   7.4    67      5    1       2
1     36.0    118.0   8.0    72      5    2       3
2     12.0    149.0  12.6    74      5    3       4
3     18.0    313.0  11.5    62      5    4       5
4      NaN      NaN  14.3    56      5    5       6
..     ...      ...   ...   ...    ...  ...     ...
148   30.0    193.0   6.9    70      9   26      27
149    NaN    145.0  13.2    77      9   27      28
150   14.0    191.0  14.3    75      9   28      29
151   18.0    131.0   8.0    76      9   29      30
152   20.0    223.0  11.5    68      9   30      31

[153 rows x 7 columns]

However, this will cause GT(air) to fail since it's no longer recognized as a Pandas dataframe. Therefore, we would need to modify the code to store the truly underlying _tbl_data from air.pandas.

These two approaches appear to be related to issue #8.

posit-dev / great-tables