Closed yusufuyanik1 closed 2 weeks ago
Attention: Patch coverage is 87.50000%
with 1 line
in your changes missing coverage. Please review.
Project coverage is 59.73%. Comparing base (
8c9874a
) to head (669ae72
). Report is 2 commits behind head on master.
Files | Patch % | Lines |
---|---|---|
python/pdstools/utils/cdh_utils.py | 87.50% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This assumes pyName is only encapsulating Name + Treatment, which isn't true. You can configure anything in the model context and whatever is outside of the std context fields, gets respresented as JSON in pyName. We have some customers including e.g. a category in there.
This assumes pyName is only encapsulating Name + Treatment, which isn't true. You can configure anything in the model context and whatever is outside of the std context fields, gets respresented as JSON in pyName. We have some customers including e.g. a category in there.
Agreed. Maybe it would be useful to have some minimum reproducible example here so we can see exactly why it was failing before, because it's not entirely clear to me and it may help us find a more general solution?
The alt approach here is to take the unique pyNames, then only un-json the ones that start with a { into a new frame that will have all the new columns. Then merge back that new frame into the existing df. Might be faster as well. I did something along those lines in the R version. Merging back is a little tricky because columns may already exist and should be overwritten etc.
@yusufuyanik1 is this still relevant? If so, looks like it needs a rebase. But given by the above discussion, feels like there's still some work to be done?
json_decode
was the problem, the default infer_schema_length
is 100 and if there are no treatments (only pyName) in the first 100 rows, then it is not parsing other columns in pyName.
Below is a reproducible example, in the first 100 rows of pyName we have string and then comes the json. If the infer_schema_length of json_decode is not set, it won't parse pyTreatment
import polars as pl
df = pl.DataFrame(
{
"pyName": ["GoldCard"] * 100
+ ['{"pyName":"SilverCard","pyTreatment":"Silver_Web"}']
}
)
col = "pyName"
def safe_name():
return (
pl.when(~pl.col(col).cast(pl.Utf8).str.starts_with("{"))
.then(pl.concat_str([pl.lit('{"pyName":"'), pl.col(col), pl.lit('"}')]))
.otherwise(pl.col(col))
).alias("tempName")
series = (
df.select(
safe_name().str.json_decode(),
)
.unnest("tempName")
.lazy()
.collect()
)
series
Should we set this to None? @StijnKas
Sort first so any { is present in the first 100?
Group by configuration name and check for each of those separately then do an any?
Op vr 16 aug 2024 om 11:57 schreef yusufuyanik1 @.***>
json_decode was the problem, the default infer_schema_length is 100 and if there are no treatments (only pyName) in the first 100 rows, then it is not parsing other columns in pyName.
Below is a reproducible example, in the first 100 rows of pyName we have string and then comes the json. If the infer_schema_length of json_decode is not set, it won't parse pyTreatment
import polars as pl df = pl.DataFrame( { "pyName": ["GoldCard"] * 100
- ['{"pyName":"SilverCard","pyTreatment":"Silver_Web"}'] } )
col = "pyName" def safe_name(): return ( pl.when(~pl.col(col).cast(pl.Utf8).str.starts_with("{")) .then(pl.concat_str([pl.lit('{"pyName":"'), pl.col(col), pl.lit('"}')])) .otherwise(pl.col(col)) ).alias("tempName")
series = ( df.select( safe_name().str.json_decode(), ) .unnest("tempName") .lazy() .collect() ) series
Screenshot.2024-08-16.at.11.56.33.png (view on web) https://github.com/user-attachments/assets/34601003-9237-4fee-b538-f864e4730127 Should we set this to None? @StijnKas https://github.com/StijnKas
— Reply to this email directly, view it on GitHub https://github.com/pegasystems/pega-datascientist-tools/pull/242#issuecomment-2293206762, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEVTKEWIHW2C2PKZLXXKU3ZRXEI3AVCNFSM6AAAAABJ5MBOJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTGIYDMNZWGI . You are receiving this because you were mentioned.Message ID: @.***>
"Sort first so any { is present in the first 100?" If there is a third property it may still be unseen, for example a dataframe like below the row with pyCategory may not be up in the first 100 rows:
df = pl.DataFrame(
{
"pyName": ["GoldCard"]
+ ['{"pyName":"SilverCard","pyTreatment":"Silver_Web"}'] * 200
+ ['{"pyName":"SilverCard","pyTreatment":"Silver_Web","pyCategory":"Some Category"}']
}
)
Group by configuration name and check for each of those separately then do an any?
I don't think we should include another column here, we are currently only collecting 1 column. I don't think it would be more efficient to include configuration and sort.
Maybe take just head(1) per configuration - I think either all or none of the pyName fields are json , per model rule / configuration. Unless someone changes that over time :( So perhaps by config + timestamp?
Op vr 16 aug 2024 om 14:33 schreef yusufuyanik1 @.***>
"Sort first so any { is present in the first 100?" If there is a third property it may still be unseen, for example a dataframe like below the row with pyCategory may not be up in the first 100 rows:
df = pl.DataFrame( { "pyName": ["GoldCard"]
- ['{"pyName":"SilverCard","pyTreatment":"Silver_Web"}'] * 200
['{"pyName":"SilverCard","pyTreatment":"Silver_Web","pyCategory":"Some Category"}']
} )
Group by configuration name and check for each of those separately then do an any?
I don't think we should include another column here, we are currently only collecting 1 column. I don't think it would be more efficient to include configuration and sort.
— Reply to this email directly, view it on GitHub https://github.com/pegasystems/pega-datascientist-tools/pull/242#issuecomment-2293429126, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEVTKAOYQ5XDMGK2XAIMEDZRXWR7AVCNFSM6AAAAABJ5MBOJ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTGQZDSMJSGY . You are receiving this because you were mentioned.Message ID: @.***>
infer_schema_lenght was set to None in this commit
This operation is being applied to modelData, which is a moderately sized(in terms of row count). Also, we are already collecting this column later 2 rows below, so i don't think it is a big deal.
There was a bug with the safeName function when we had both json and string in pyName column. When pyName column looks like this:
df.select(safeName().str.json_decode()).unnest("tempName")
This returns only pyName I assume because of this conflict where the pyTreatment column has both a string from first row ("Gold") and a None type from second row.Now i am filling pyTreatment with an empty string if it does not exist and this fixes the issue.
@StijnKas please review this. @operdeck fyi this is why we didn't get the pyTreatment column.