pegasystems / pega-datascientist-tools

Pega Data Scientist Tools
https://github.com/pegasystems/pega-datascientist-tools/wiki
Apache License 2.0
33 stars 27 forks source link

Maintain Pandas Support #284

Closed yusufuyanik1 closed 1 hour ago

yusufuyanik1 commented 1 hour ago

pdstools version checks

Issue description

I can't pass a pandas dataframe to ADMDatamart. We should add unit tests to make sure everything works with pandas dataframes.

Reproducible example

from pdstools.pega_io import read_ds_export
from pdstools import ADMDatamart

model_df = read_ds_export(
    "/Users/uyany/Documents/GitHub/pega-datascientist-tools/data/Data-Decision-ADM-ModelSnapshot_pyModelSnapshots_20210101T010000_GMT.zip"
)
predictor_df = read_ds_export(
    "/Users/uyany/Documents/GitHub/pega-datascientist-tools/data/Data-Decision-ADM-PredictorBinningSnapshot_pyADMPredictorSnapshots_20210101T010000_GMT.zip"
)
model_pd = model_df.collect().to_pandas()
predictor_pd = predictor_df.collect().to_pandas()
datamart = ADMDatamart(model_df=model_pd, predictor_df=model_pd)

AttributeError Traceback (most recent call last) /var/folders/bq/fz2s5g595dg1xkwmsvjyq_v80000gq/T/ipykernel_14409/699583266.py in ?() 8 "/Users/uyany/Documents/GitHub/pega-datascientist-tools/data/Data-Decision-ADM-PredictorBinningSnapshot_pyADMPredictorSnapshots_20210101T010000_GMT.zip" 9 ) 10 model_pd = model_df.collect().to_pandas() 11 predictor_pd = predictor_df.collect().to_pandas() ---> 12 datamart = ADMDatamart(model_df=model_pd, predictor_df=model_pd)

~/Documents/GitHub/pega-datascientist-tools/.venv/lib/python3.11/site-packages/pdstools/adm/ADMDatamart.py in ?(self, model_df, predictor_df, query, extract_pyname_keys) 100 self.agb = AGB(datamart=self) 101 self.generate = Reports(datamart=self) 102 self.cdh_guidelines = CDHGuidelines() 103 --> 104 self.model_data = self._validate_model_data( 105 model_df, query=query, extract_pyname_keys=extract_pyname_keys 106 ) 107

~/Documents/GitHub/pega-datascientist-tools/.venv/lib/python3.11/site-packages/pdstools/adm/ADMDatamart.py in ?(self, df, query, extract_pyname_keys) 177 if df is None: 178 logger.info("No model data available.") 179 return df 180 --> 181 df = _polars_capitalize(df) 182 schema = df.collect_schema() 183 if extract_pyname_keys and "Name" in schema.names(): 184 df = cdh_utils._extract_keys(df)

~/Documents/GitHub/pega-datascientist-tools/.venv/lib/python3.11/site-packages/pdstools/utils/cdh_utils.py in ?(df) 574 def _polars_capitalize(df: F) -> F: --> 575 cols = df.collect_schema().names() 576 renamed_cols = _capitalize(cols) 577 578 def deduplicate(columns: List[str]):

~/Documents/GitHub/pega-datascientist-tools/.venv/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, name) 6295 and name not in self._accessors 6296 and self._info_axis._can_hold_identifiers_and_holds_name(name) 6297 ): 6298 return self[name] -> 6299 return object.getattribute(self, name)

AttributeError: 'DataFrame' object has no attribute 'collect_schema'


### Expected behavior

I should be able to create an ADMDatamart object from a pandas dataframe.

### Installed versions

<details>

--- Version info --- pdstools: 4.0.0a1 Platform: macOS-14.7.1-arm64-arm-64bit Python: 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]

--- Dependencies --- polars>=1.9: 1.13.0 typing_extensions: 4.12.2

--- Dependency group: adm --- plotly>=5.5.0: 5.24.1

--- Dependency group: pega-io --- polars_hash: 0.5.0 aioboto3: 13.2.0

--- Dependency group: api --- httpx: 0.27.2 anyio: 4.6.2.post1 pydantic: 2.10.0b2

--- Dependency group: healthcheck --- xlsxwriter>=3.0: 3.2.0 papermill: 2.6.0 great_tables>=0.13: 0.14.0 pydot: 3.0.2 quarto: 0.1.0 plotly>=5.5.0: 5.24.1

--- Dependency group: app --- xlsxwriter>=3.0: 3.2.0 papermill: 2.6.0 great_tables>=0.13: 0.14.0 st-pages: pydot: 3.0.2 streamlit>=1.23: 1.40.1 quarto: 0.1.0 plotly>=5.5.0: 5.24.1

--- Dependency group: onnx --- anyio: 4.6.2.post1 onnxruntime==1.18.1: 1.18.1 pydantic: 2.10.0b2 onnx==1.16.1: 1.16.1 httpx: 0.27.2 scikit-learn==1.5.1: skl2onnx==1.17.0: 1.17.0

--- Dependency group: all --- xlsxwriter>=3.0: 3.2.0 great_tables>=0.13: 0.14.0 pydot: 3.0.2 onnxruntime==1.18.1: 1.18.1 onnx==1.16.1: 1.16.1 httpx: 0.27.2 quarto: 0.1.0 scikit-learn==1.5.1: skl2onnx==1.17.0: 1.17.0 plotly>=5.5.0: 5.24.1 st-pages: papermill: 2.6.0 anyio: 4.6.2.post1 streamlit>=1.23: 1.40.1 pydantic: 2.10.0b2

--- Dependency group: docs --- nbsphinx: myst-parser: sphinx-autoapi: furo: sphinx-copybutton: sphinx:

--- Dependency group: tests --- great_tables>=0.13: 0.14.0 pytest-httpx: coverage: httpx: 0.27.2 quarto: 0.1.0 skl2onnx==1.17.0: 1.17.0 anyio: 4.6.2.post1 pydantic: 2.10.0b2 pytest: testbook: moto: xlsxwriter>=3.0: 3.2.0 pytest-cov: pydot: 3.0.2 onnxruntime==1.18.1: 1.18.1 onnx==1.16.1: 1.16.1 scikit-learn==1.5.1: pytest-mock: plotly>=5.5.0: 5.24.1 openpyxl: st-pages: papermill: 2.6.0 streamlit>=1.23: 1.40.1



</details>
StijnKas commented 1 hour ago

This was a design decision - do you really think we should add support for pandas at this stage?

IMO, since we're using polars internally, it's not a strange expectation for people to just call pl.DataFrame on their pandas dataframe if they really need to do the first steps in pandas

yusufuyanik1 commented 1 hour ago

User should convert the pandas dataframe to polars before passing it into ADMDatamart.

import polars as pl

model_pl = pl.from_pandas(model_pd)
predictor_pl = pl.from_pandas(predictor_pd)

datamart = ADMDatamart(model_df=model_pl, predictor_df=predictor_pl)
StijnKas commented 1 hour ago

User should convert the pandas dataframe to polars before passing it into ADMDatamart.

import polars as pl

model_pl = pl.from_pandas(model_pd)
predictor_pl = pl.from_pandas(predictor_pd)

datamart = ADMDatamart(model_df=model_pl, predictor_df=predictor_pl)

To be precise, I would recommend them to use a lazy frame, since we're assuming lazy frames everywhere. So, the recommended approach would be:

import polars as pl

datamart = ADMDatamart(model_df=pl.LazyFrame(model_pd), predictor_df=pl.LazyFrame(predictor_pd))