Open olivierabz opened 2 years ago
To recover the scale matrix (A) from the coordinates (AB) and V.T (B), it is necessary to do the operation, AB*B-1 or when we specify nf, B is not a square matrix (nfmax by nfmax) but (nfmax by nf) so at this stage we can’t avoid to lose some information
As a result, since the scaled values that we recover are different from the original ones, there is necessarily an offset when we descale
Example:
Ori
var1___0 var1___1 score___truc score___truc2 score___truc3 score___truc4
0 1 0 1 0 0 0
1 0 1 0 1 0 0
2 1 0 0 0 1 0
3 1 0 0 0 1 0
Transform then inverse with nf=3
var1___0 var1___1 score___truc score___truc2 score___truc3 score___truc4
0 0.981898 0.018102 1.130183 -0.130183 -0.148284 0.148284
1 -0.004731 1.004731 0.034022 0.965978 -0.038753 0.038753
2 1.022178 -0.022178 -0.159499 0.159499 1.181677 -0.181677
3 0.981244 0.018756 0.134888 -0.134888 0.846356 0.153644
NB: the more nf is close to nf max the more transform then inverse will be like ori
Rescaling single_category
in undummify
with the following function may help ?
def NormalizeData(data):
data = (data - np.min(data)) / (np.max(data) - np.min(data))
data = data.div(data.sum(axis=1), axis=0)
return data
Benchmark realized on WBCD full categorical data to call mca:
Benchmark realized on WBCD mixed categorical data
[['Clump_Thickness', 'Uniformity_of_Cell_Size', 'Uniformity_of_Cell_Shape', 'Marginal_Adhesion', 'Single_Epithelial_Cell_Size', 'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli']]
and continuous to call famd:
Same thing but considering negative modalities have 0 chance to be chosen:
for col in data.columns:
data.loc[data[col] < 0, col] = 0
data = (data - np.min(data)) / (np.max(data) - np.min(data))
data = data.div(data.sum(axis=1), axis=0)
return data
mca results
famd results
The loss of precision is inherent in the choice of a restricted nf before inverse transform. This is not an error but a linear algebra property.
The issue will be solved by adding a warning for users performing the inverse transform with nf<nfmax
The values of dummy_df passed as arg of
undummify
ininverse_transform
are not probabilities. i.e. they do not necessarily sum to 1 and they contain negative values. It is especially visible when using MCA but can also be seen with FAMD.Doing the
fit_transform
followed byinverse_transform
on the following df withuse_max_modality=False
produces something that looks like this:
and the resulting
cum_probability
looks like:This is breaking the logic of
get_random_weighted_columns
which assumes that each row of cum_probability ends with 1.0