Closed yusufuyanik1 closed 1 hour ago
This was a design decision - do you really think we should add support for pandas at this stage?
IMO, since we're using polars internally, it's not a strange expectation for people to just call pl.DataFrame
on their pandas dataframe if they really need to do the first steps in pandas
User should convert the pandas dataframe to polars before passing it into ADMDatamart.
import polars as pl
model_pl = pl.from_pandas(model_pd)
predictor_pl = pl.from_pandas(predictor_pd)
datamart = ADMDatamart(model_df=model_pl, predictor_df=predictor_pl)
User should convert the pandas dataframe to polars before passing it into ADMDatamart.
import polars as pl model_pl = pl.from_pandas(model_pd) predictor_pl = pl.from_pandas(predictor_pd) datamart = ADMDatamart(model_df=model_pl, predictor_df=predictor_pl)
To be precise, I would recommend them to use a lazy frame, since we're assuming lazy frames everywhere. So, the recommended approach would be:
import polars as pl datamart = ADMDatamart(model_df=pl.LazyFrame(model_pd), predictor_df=pl.LazyFrame(predictor_pd))
pdstools version checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pdstools.
Issue description
I can't pass a pandas dataframe to ADMDatamart. We should add unit tests to make sure everything works with pandas dataframes.
Reproducible example
AttributeError Traceback (most recent call last) /var/folders/bq/fz2s5g595dg1xkwmsvjyq_v80000gq/T/ipykernel_14409/699583266.py in ?() 8 "/Users/uyany/Documents/GitHub/pega-datascientist-tools/data/Data-Decision-ADM-PredictorBinningSnapshot_pyADMPredictorSnapshots_20210101T010000_GMT.zip" 9 ) 10 model_pd = model_df.collect().to_pandas() 11 predictor_pd = predictor_df.collect().to_pandas() ---> 12 datamart = ADMDatamart(model_df=model_pd, predictor_df=model_pd)
~/Documents/GitHub/pega-datascientist-tools/.venv/lib/python3.11/site-packages/pdstools/adm/ADMDatamart.py in ?(self, model_df, predictor_df, query, extract_pyname_keys) 100 self.agb = AGB(datamart=self) 101 self.generate = Reports(datamart=self) 102 self.cdh_guidelines = CDHGuidelines() 103 --> 104 self.model_data = self._validate_model_data( 105 model_df, query=query, extract_pyname_keys=extract_pyname_keys 106 ) 107
~/Documents/GitHub/pega-datascientist-tools/.venv/lib/python3.11/site-packages/pdstools/adm/ADMDatamart.py in ?(self, df, query, extract_pyname_keys) 177 if df is None: 178 logger.info("No model data available.") 179 return df 180 --> 181 df = _polars_capitalize(df) 182 schema = df.collect_schema() 183 if extract_pyname_keys and "Name" in schema.names(): 184 df = cdh_utils._extract_keys(df)
~/Documents/GitHub/pega-datascientist-tools/.venv/lib/python3.11/site-packages/pdstools/utils/cdh_utils.py in ?(df) 574 def _polars_capitalize(df: F) -> F: --> 575 cols = df.collect_schema().names() 576 renamed_cols = _capitalize(cols) 577 578 def deduplicate(columns: List[str]):
~/Documents/GitHub/pega-datascientist-tools/.venv/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, name) 6295 and name not in self._accessors 6296 and self._info_axis._can_hold_identifiers_and_holds_name(name) 6297 ): 6298 return self[name] -> 6299 return object.getattribute(self, name)
AttributeError: 'DataFrame' object has no attribute 'collect_schema'
--- Version info --- pdstools: 4.0.0a1 Platform: macOS-14.7.1-arm64-arm-64bit Python: 3.11.9 (main, Apr 2 2024, 08:25:04) [Clang 15.0.0 (clang-1500.3.9.4)]
--- Dependencies --- polars>=1.9: 1.13.0 typing_extensions: 4.12.2
--- Dependency group: adm --- plotly>=5.5.0: 5.24.1
--- Dependency group: pega-io --- polars_hash: 0.5.0 aioboto3: 13.2.0
--- Dependency group: api --- httpx: 0.27.2 anyio: 4.6.2.post1 pydantic: 2.10.0b2
--- Dependency group: healthcheck --- xlsxwriter>=3.0: 3.2.0 papermill: 2.6.0 great_tables>=0.13: 0.14.0 pydot: 3.0.2 quarto: 0.1.0 plotly>=5.5.0: 5.24.1
--- Dependency group: app --- xlsxwriter>=3.0: 3.2.0 papermill: 2.6.0 great_tables>=0.13: 0.14.0 st-pages:
pydot: 3.0.2
streamlit>=1.23: 1.40.1
quarto: 0.1.0
plotly>=5.5.0: 5.24.1
--- Dependency group: onnx --- anyio: 4.6.2.post1 onnxruntime==1.18.1: 1.18.1 pydantic: 2.10.0b2 onnx==1.16.1: 1.16.1 httpx: 0.27.2 scikit-learn==1.5.1:
skl2onnx==1.17.0: 1.17.0
--- Dependency group: all --- xlsxwriter>=3.0: 3.2.0 great_tables>=0.13: 0.14.0 pydot: 3.0.2 onnxruntime==1.18.1: 1.18.1 onnx==1.16.1: 1.16.1 httpx: 0.27.2 quarto: 0.1.0 scikit-learn==1.5.1:
skl2onnx==1.17.0: 1.17.0
plotly>=5.5.0: 5.24.1
st-pages:
papermill: 2.6.0
anyio: 4.6.2.post1
streamlit>=1.23: 1.40.1
pydantic: 2.10.0b2
--- Dependency group: docs --- nbsphinx:
myst-parser:
sphinx-autoapi:
furo:
sphinx-copybutton:
sphinx:
--- Dependency group: tests --- great_tables>=0.13: 0.14.0 pytest-httpx:
coverage:
httpx: 0.27.2
quarto: 0.1.0
skl2onnx==1.17.0: 1.17.0
anyio: 4.6.2.post1
pydantic: 2.10.0b2
pytest:
testbook:
moto:
xlsxwriter>=3.0: 3.2.0
pytest-cov:
pydot: 3.0.2
onnxruntime==1.18.1: 1.18.1
onnx==1.16.1: 1.16.1
scikit-learn==1.5.1:
pytest-mock:
plotly>=5.5.0: 5.24.1
openpyxl:
st-pages:
papermill: 2.6.0
streamlit>=1.23: 1.40.1