fix dummified or undummified values not being probabilities

olivierabz commented 2 years ago

The values of dummy_df passed as arg of undummify in inverse_transform are not probabilities. i.e. they do not necessarily sum to 1 and they contain negative values. It is especially visible when using MCA but can also be seen with FAMD.

Doing the fit_transform followed by inverse_transform on the following df with use_max_modality=False

df = pd.DataFrame(
        {
            "var1": ["0", "1", "0", "0", "0", "1", "0", "1", "0", "0", "0", "1"],
            "score": ["truc", "truc2", "truc3", "truc3", "truc", "truc4", "truc", "truc2", "truc3", "truc3", "truc", "truc4"],
        }
    )

produces something that looks like this:

df:      score___truc  score___truc2  score___truc3  score___truc4
0   1.333333e+00   9.217959e-17  -1.350330e-16  -1.004477e-16
1   5.190171e-17   6.666667e-01  -2.460553e-16   1.057464e-17
2  -1.350330e-16  -1.410457e-16   1.333333e+00  -6.060742e-19
3  -1.350330e-16  -1.410457e-16   1.333333e+00  -6.060742e-19
4   1.333333e+00   9.217959e-17  -1.350330e-16  -1.004477e-16
5  -1.037181e-16   4.745672e-17  -1.056156e-16   6.666667e-01
6   1.333333e+00   9.217959e-17  -1.350330e-16  -1.004477e-16
7   5.190171e-17   6.666667e-01  -2.460553e-16   1.057464e-17
8  -1.350330e-16  -1.410457e-16   1.333333e+00  -6.060742e-19
9  -1.350330e-16  -1.410457e-16   1.333333e+00  -6.060742e-19
10  1.333333e+00   9.217959e-17  -1.350330e-16  -1.004477e-16
11 -1.037181e-16   4.745672e-17  -1.056156e-16   6.666667e-01

and the resulting cum_probability looks like:

cum_probability:      score___truc  score___truc2  score___truc3  score___truc4
0   1.333333e+00   1.333333e+00   1.333333e+00       1.333333
1   5.190171e-17   6.666667e-01   6.666667e-01       0.666667
2  -1.350330e-16  -2.760787e-16   1.333333e+00       1.333333
3  -1.350330e-16  -2.760787e-16   1.333333e+00       1.333333
4   1.333333e+00   1.333333e+00   1.333333e+00       1.333333
5  -1.037181e-16  -5.626138e-17  -1.618770e-16       0.666667
6   1.333333e+00   1.333333e+00   1.333333e+00       1.333333
7   5.190171e-17   6.666667e-01   6.666667e-01       0.666667
8  -1.350330e-16  -2.760787e-16   1.333333e+00       1.333333
9  -1.350330e-16  -2.760787e-16   1.333333e+00       1.333333
10  1.333333e+00   1.333333e+00   1.333333e+00       1.333333
11 -1.037181e-16  -5.626138e-17  -1.618770e-16       0.666667

This is breaking the logic of get_random_weighted_columns which assumes that each row of cum_probability ends with 1.0

mguillaudeux commented 2 years ago

To recover the scale matrix (A) from the coordinates (AB) and V.T (B), it is necessary to do the operation, AB*B-1 or when we specify nf, B is not a square matrix (nfmax by nfmax) but (nfmax by nf) so at this stage we can’t avoid to lose some information

As a result, since the scaled values that we recover are different from the original ones, there is necessarily an offset when we descale

Example:

Ori

    var1___0    var1___1    score___truc    score___truc2   score___truc3   score___truc4
0   1   0   1   0   0   0
1   0   1   0   1   0   0
2   1   0   0   0   1   0
3   1   0   0   0   1   0

Transform then inverse with nf=3

    var1___0    var1___1    score___truc    score___truc2   score___truc3   score___truc4
0   0.981898    0.018102    1.130183    -0.130183   -0.148284   0.148284
1   -0.004731   1.004731    0.034022    0.965978    -0.038753   0.038753
2   1.022178    -0.022178   -0.159499   0.159499    1.181677    -0.181677
3   0.981244    0.018756    0.134888    -0.134888   0.846356    0.153644

NB: the more nf is close to nf max the more transform then inverse will be like ori

mguillaudeux commented 2 years ago

Rescaling single_category in undummify with the following function may help ?

def NormalizeData(data):
    data = (data - np.min(data)) / (np.max(data) - np.min(data))
    data = data.div(data.sum(axis=1), axis=0)
    return data

mguillaudeux commented 2 years ago

Benchmark realized on WBCD full categorical data to call mca:

mguillaudeux commented 2 years ago

Benchmark realized on WBCD mixed categorical data [['Clump_Thickness', 'Uniformity_of_Cell_Size', 'Uniformity_of_Cell_Shape', 'Marginal_Adhesion', 'Single_Epithelial_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli']] and continuous to call famd:

mguillaudeux commented 2 years ago

Same thing but considering negative modalities have 0 chance to be chosen:

   for col in data.columns:
        data.loc[data[col] < 0, col] = 0
    data = (data - np.min(data)) / (np.max(data) - np.min(data))
    data = data.div(data.sum(axis=1), axis=0)
    return data

mca results

famd results

mguillaudeux commented 2 years ago

The loss of precision is inherent in the choice of a restricted nf before inverse transform. This is not an error but a linear algebra property.

The issue will be solved by adding a warning for users performing the inverse transform with nf<nfmax

octopize / saiph

fix dummified or undummified values not being probabilities #86