pegasystems / pega-datascientist-tools

Pega Data Scientist Tools
https://github.com/pegasystems/pega-datascientist-tools/wiki
Apache License 2.0
33 stars 25 forks source link

AUC util funcs div by zero errors #264

Open operdeck opened 3 weeks ago

operdeck commented 3 weeks ago

pdstools version checks

Issue description

For real-life datamart data we sometimes encounter div by zero errors in the AUC calculation methods.

/Users/perdo/Library/Python/3.12/lib/python/site-packages/pdstools/utils/cdh_utils.py:362: RuntimeWarning: invalid value encountered in divide probs = pos / (pos + neg) /Users/perdo/Library/Python/3.12/lib/python/site-packages/pdstools/utils/cdh_utils.py:365: RuntimeWarning: invalid value encountered in divide FPR = np.cumsum(neg[binorder]) / np.sum(neg) /Users/perdo/Library/Python/3.12/lib/python/site-packages/pdstools/utils/cdh_utils.py:366: RuntimeWarning: invalid value encountered in divide TPR = np.cumsum(pos[binorder]) / np.sum(pos)

Reproducible example

pos = np.asarray(pos)
    neg = np.asarray(neg)
    if probs is None:
        probs = pos / (pos + neg)

    binorder = np.argsort(probs)[::-1]
    FPR = np.cumsum(neg[binorder]) / np.sum(neg)
    TPR = np.cumsum(pos[binorder]) / np.sum(pos)

    Area = (np.diff(FPR, prepend=0)) * (TPR + np.insert(np.roll(TPR, 1)[1:], 0, 0)) / 2
    return safe_range_auc(np.sum(Area))

Expected behavior

No errors. Method should be protecting against zero divide. Not by adding a 1 or 0.5 but we can probably detect and skip - need to double check the algorithm. It's probably both the auc_from_bincounts and aucpr_from_bincounts, which are very similar.

Installed versions

``` Replace this line with the output of---Version info--- pdstools: 3.4.4 Platform: macOS-14.6.1-arm64-arm-64bit Python: 3.12.3 (v3.12.3:f6650f9ad7, Apr 9 2024, 08:18:47) [Clang 13.0.0 (clang-1300.0.29.30)] ---Dependencies--- plotly: 5.22.0 requests: 2.32.3 pydot: 2.0.0 polars: 0.20.31 pyarrow: 16.1.0 tqdm: 4.66.4 pyyaml: aioboto3: 13.0.1 ---Streamlit app dependencies--- streamlit: 1.38.0 quarto: papermill: 2.6.0 itables: pandas: 2.2.2 jinja2: 3.1.4 xlsxwriter: 3.2.0), leave the backticks in place ```
StijnKas commented 1 week ago

Think you could squeeze this in for #260 @operdeck ?