Closed AlexandreWadoux closed 4 years ago
The scale of the results retrieved is now different than the one in the previous version. This come from a known bug in the scaling of the final results (as reported in the NEWS file).
The distance ratios (between samples) were correctly calculated, but the final scaling of the results was not properly done. The distance between Xi and Xj were scaled by taking the squared root of the mean of the squared differences and dividing it by the number of variables i.e. sqrt(mean((Xi-Xj)^2))/ncol(Xi), however the correct calculation is done by taking the mean of the squared differences, dividing it by the number of variables and then compute the squared root i.e. sqrt(mean((Xi-Xj)^2)/ncol(Xi)). This bug had no effect on the computations of the nearest neighbors.
The following code might help to understand how the scaling is now done:
library(prospectr)
data(NIRsoil)
Xr <- NIRsoil$spc[as.logical(NIRsoil$train),]
# Mahalanobis distance computed on the first 20 spectral variables
n_variables <- 20
# resemble
md <- f_diss(
Xr[, 1:n_variables],
Xr[1, 1:n_variables, drop = FALSE],
"mahalanobis",
center = FALSE
)
# rstats
md_r <- mahalanobis(
Xr[, 1:n_variables],
center = Xr[1, 1:n_variables, drop = FALSE],
cov = cov(Xr[, 1:n_variables])
)
md_r <- sqrt((md_r)/n_variables) # scaling using the number of variables
plot(md, md_r)
With the new change in the f_diss function I obtain a different results by running my code.
and the plot:
now it gives me a much larger distance: