pegasystems / pega-datascientist-tools

Pega Data Scientist Tools
https://github.com/pegasystems/pega-datascientist-tools/wiki
Apache License 2.0
33 stars 24 forks source link

ADMDatamart init fails when data is empty and extracting keys #217

Closed operdeck closed 2 months ago

operdeck commented 3 months ago

pdstools version checks

Issue description

When passing an empty dataframe to ADMDatamart and setting extract_keys to True, the initialiser fails with an ugly polars exception.

Passing an empty dataframe can happen when filtering for certain date ranges like in some of the dashboarding applications. Extracting the treatments is standard and should probably default to True.

Reproducible example

df=pdc.pdc_data.filter(pl.col("ModelType") == "AdaptiveModel").filter(current_reporting_period)
adm = ADMDatamart(model_df=df, extract_keys=False)
adm.is_available

The dataframe df is empty but has a schema:

shape: (0, 20)
ModelClass  ModelID ModelName   ModelType   Name    Negatives   Performance Positives   ResponseCount   SnapshotTime    Channel Direction   Group   Issue   TotalPositives  ConfigurationName   TotalPredictors ActivePredictors    CTR ElapsedDays
str str str str str f32 f32 f32 f32 date    str str str str str str i32 i32 f32 i64

==> False

So far so good, but now

adm = ADMDatamart(model_df=df, extract_keys=True)

Errors out with

SchemaError: invalid series dtype: expected `Struct`, got `null`

Expected behavior

Expect to just continue and return False if asked for is_available.

Installed versions

``` Replace this line with the output of pdstools.show_versions(), leave the backticks in place ```
StijnKas commented 2 months ago

Right I can see how this would work. We call 'struct.field', if the column is not a struct at all, do you think we should just leave it as null? It doesn't look like it has anything to do with is_available

operdeck commented 2 months ago

I would expect the same return value (empty dataframe with a schema) regardless the value of "extract_keys". But if that costs some code complexity we can also just return None. The caller can easily check.

StijnKas commented 2 months ago

Tiny bit of complexity added - I'm simply doing a little pre-check to pull the very first row of the name column, and checking if the length of the df is >0. If not, I just return the original dataframe. Since extracting keys is eager anyway, this seems reasonable.