pegasystems / pega-datascientist-tools

Pega Data Scientist Tools
https://github.com/pegasystems/pega-datascientist-tools/wiki
Apache License 2.0
33 stars 24 forks source link

fix safeName function #242

Closed yusufuyanik1 closed 2 weeks ago

yusufuyanik1 commented 2 months ago

There was a bug with the safeName function when we had both json and string in pyName column. When pyName column looks like this:

  1. '{"pyName":"CreditCard","pyTreatment":"Gold"}'
  2. 'Iphone'

df.select(safeName().str.json_decode()).unnest("tempName") This returns only pyName I assume because of this conflict where the pyTreatment column has both a string from first row ("Gold") and a None type from second row.

Now i am filling pyTreatment with an empty string if it does not exist and this fixes the issue.

@StijnKas please review this. @operdeck fyi this is why we didn't get the pyTreatment column.

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.

Project coverage is 59.73%. Comparing base (8c9874a) to head (669ae72). Report is 2 commits behind head on master.

Files Patch % Lines
python/pdstools/utils/cdh_utils.py 87.50% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #242 +/- ## ======================================= Coverage 59.73% 59.73% ======================================= Files 29 29 Lines 3807 3807 ======================================= Hits 2274 2274 Misses 1533 1533 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

operdeck commented 2 months ago

This assumes pyName is only encapsulating Name + Treatment, which isn't true. You can configure anything in the model context and whatever is outside of the std context fields, gets respresented as JSON in pyName. We have some customers including e.g. a category in there.

StijnKas commented 2 months ago

This assumes pyName is only encapsulating Name + Treatment, which isn't true. You can configure anything in the model context and whatever is outside of the std context fields, gets respresented as JSON in pyName. We have some customers including e.g. a category in there.

Agreed. Maybe it would be useful to have some minimum reproducible example here so we can see exactly why it was failing before, because it's not entirely clear to me and it may help us find a more general solution?

operdeck commented 2 months ago

The alt approach here is to take the unique pyNames, then only un-json the ones that start with a { into a new frame that will have all the new columns. Then merge back that new frame into the existing df. Might be faster as well. I did something along those lines in the R version. Merging back is a little tricky because columns may already exist and should be overwritten etc.

StijnKas commented 1 month ago

@yusufuyanik1 is this still relevant? If so, looks like it needs a rebase. But given by the above discussion, feels like there's still some work to be done?

yusufuyanik1 commented 4 weeks ago

json_decode was the problem, the default infer_schema_length is 100 and if there are no treatments (only pyName) in the first 100 rows, then it is not parsing other columns in pyName.

Below is a reproducible example, in the first 100 rows of pyName we have string and then comes the json. If the infer_schema_length of json_decode is not set, it won't parse pyTreatment

import polars as pl
df = pl.DataFrame(
    {
        "pyName": ["GoldCard"] * 100
        + ['{"pyName":"SilverCard","pyTreatment":"Silver_Web"}']
    }
)

col = "pyName"
def safe_name():
        return (
            pl.when(~pl.col(col).cast(pl.Utf8).str.starts_with("{"))
            .then(pl.concat_str([pl.lit('{"pyName":"'), pl.col(col), pl.lit('"}')]))
            .otherwise(pl.col(col))
        ).alias("tempName")

series = (
        df.select(
            safe_name().str.json_decode(),
        )
        .unnest("tempName")
        .lazy()
        .collect()
    )
series

Screenshot 2024-08-16 at 11 56 33 Should we set this to None? @StijnKas

operdeck commented 3 weeks ago

Sort first so any { is present in the first 100?

Group by configuration name and check for each of those separately then do an any?

Op vr 16 aug 2024 om 11:57 schreef yusufuyanik1 @.***>

json_decode was the problem, the default infer_schema_length is 100 and if there are no treatments (only pyName) in the first 100 rows, then it is not parsing other columns in pyName.

Below is a reproducible example, in the first 100 rows of pyName we have string and then comes the json. If the infer_schema_length of json_decode is not set, it won't parse pyTreatment

import polars as pl df = pl.DataFrame( { "pyName": ["GoldCard"] * 100

  • ['{"pyName":"SilverCard","pyTreatment":"Silver_Web"}'] } )

col = "pyName" def safe_name(): return ( pl.when(~pl.col(col).cast(pl.Utf8).str.starts_with("{")) .then(pl.concat_str([pl.lit('{"pyName":"'), pl.col(col), pl.lit('"}')])) .otherwise(pl.col(col)) ).alias("tempName")

series = ( df.select( safe_name().str.json_decode(), ) .unnest("tempName") .lazy() .collect() ) series

Screenshot.2024-08-16.at.11.56.33.png (view on web) https://github.com/user-attachments/assets/34601003-9237-4fee-b538-f864e4730127 Should we set this to None? @StijnKas https://github.com/StijnKas

— Reply to this email directly, view it on GitHub https://github.com/pegasystems/pega-datascientist-tools/pull/242#issuecomment-2293206762, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEVTKEWIHW2C2PKZLXXKU3ZRXEI3AVCNFSM6AAAAABJ5MBOJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTGIYDMNZWGI . You are receiving this because you were mentioned.Message ID: @.***>

yusufuyanik1 commented 3 weeks ago

"Sort first so any { is present in the first 100?" If there is a third property it may still be unseen, for example a dataframe like below the row with pyCategory may not be up in the first 100 rows:

df = pl.DataFrame(
    {
        "pyName": ["GoldCard"] 
        + ['{"pyName":"SilverCard","pyTreatment":"Silver_Web"}'] * 200
        + ['{"pyName":"SilverCard","pyTreatment":"Silver_Web","pyCategory":"Some Category"}']

    }
)

Group by configuration name and check for each of those separately then do an any?

I don't think we should include another column here, we are currently only collecting 1 column. I don't think it would be more efficient to include configuration and sort.

operdeck commented 3 weeks ago

Maybe take just head(1) per configuration - I think either all or none of the pyName fields are json , per model rule / configuration. Unless someone changes that over time :( So perhaps by config + timestamp?

Op vr 16 aug 2024 om 14:33 schreef yusufuyanik1 @.***>

"Sort first so any { is present in the first 100?" If there is a third property it may still be unseen, for example a dataframe like below the row with pyCategory may not be up in the first 100 rows:

df = pl.DataFrame( { "pyName": ["GoldCard"]

  • ['{"pyName":"SilverCard","pyTreatment":"Silver_Web"}'] * 200
  • ['{"pyName":"SilverCard","pyTreatment":"Silver_Web","pyCategory":"Some Category"}']

    } )

Group by configuration name and check for each of those separately then do an any?

I don't think we should include another column here, we are currently only collecting 1 column. I don't think it would be more efficient to include configuration and sort.

— Reply to this email directly, view it on GitHub https://github.com/pegasystems/pega-datascientist-tools/pull/242#issuecomment-2293429126, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEVTKAOYQ5XDMGK2XAIMEDZRXWR7AVCNFSM6AAAAABJ5MBOJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTGQZDSMJSGY . You are receiving this because you were mentioned.Message ID: @.***>

yusufuyanik1 commented 2 weeks ago

infer_schema_lenght was set to None in this commit

This operation is being applied to modelData, which is a moderately sized(in terms of row count). Also, we are already collecting this column later 2 rows below, so i don't think it is a big deal.