KL divergence usage - Githubissues

scikit-learn-contrib / qolmat

A scikit-learn-compatible module for comparing imputation methods.

https://qolmat.readthedocs.io/en/latest/

BSD 3-Clause "New" or "Revised" License

135 stars 2 forks source link

KL divergence usage #123

Closed SalvatoreRa closed 7 months ago

SalvatoreRa commented 8 months ago

Hello,

I wanted to test different metrics, but there is something not clear in the documentation and I have not found a working example. For example, the Kullback-Leibler divergence:

import pandas as pd
from qolmat.benchmark.metrics import  kl_divergence
# Create two example Pandas Series
series1 = pd.Series([0.1, 0.2, 0.3, 0.4], colu=['A', 'B', 'C', 'D'])
series2 = pd.Series([0.15, 0.25, 0.35, 0.25], index=['A', 'B', 'C', 'D'])

# Assuming df_mask is a DataFrame with the same index as the Series
df_mask = pd.Series([True,True, True, True])

# Compute KL divergence between the two Series
kl_div = kl_divergence(series1, series2, df_mask)

print("KL Divergence:", kl_div)

This is returning an error, I am not sure what it is meaning mask in this case

Thank your for your help

JulienRoussel77 commented 8 months ago

Hello @SalvatoreRa, the function kl_divergence is expecting dataframes. We will consider adding a type check to make the error more understandable. Does it work for you with the following code?

from qolmat.benchmark.metrics import  kl_divergence
# Create two example Pandas Dataframes
df1 = pd.DataFrame([0.1, 0.2, 0.3, 0.4], index=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame([0.15, 0.25, 0.35, 0.25], index=['A', 'B', 'C', 'D'])

# Assuming df_mask is a DataFrame with the same index as the two DataFrames
df_mask = pd.DataFrame([True,True, True, True], index=['A', 'B', 'C', 'D'])

# Compute KL divergence between the two DataFrames
kl_div = kl_divergence(df1, df2, df_mask)

print("KL Divergence:", kl_div)

SalvatoreRa commented 8 months ago

Hi,

it did work, I have just few additional questions.

The mask represents where there are NA values? Or it is indicating the indexes that have to be taken into account?
It works also with np.arrays?

Thank you very much

JulienRoussel77 commented 7 months ago

Hello, great! The mask represents the values you want to compute the KL divergence on. In our case these are the additional nans which are added during the validation process. If you want to use this function independently of our setting you can adapt the source code, which is rather simple and relies on scipy.stats.entropy.