Closed fdavidcl closed 5 years ago
Hi David
We have indeed made some modifications in the current version of the package. Now all measures assume bounded values and also, for all of them, the higher the value, the more complex the problem. The reviewed paper is attached to this e-mail. We will try to update in arxiv.
Best
Ana
Em qui, 7 de mar de 2019 às 09:01, David Charte notifications@github.com escreveu:
Hi, I've come across two packages for data complexity measures: yours and RomeroBarata/dcme https://github.com/RomeroBarata/dcme, which I have somewhat extended https://github.com/fdavidcl/dcme. I am seeing some discrepancies among the results from both packages, and from yours and the definitions included in the paper:
- Lorena, A. C., Garcia, L. P. F., Lehmann, J., de Souto, M. C. P., and Ho, T. K. (2018). How Complex is your classification problem? A survey on measuring classification complexity. arXiv:1808.03591
I was hoping you could help me understand if/why your package may be altering some calculations and whether this is an error or an intended effect.
Thanks in advance!
David
Here's a list of the discrepancies I found, and a minimal example using the following variables:
x <- iris[, 1:4]y <- iris$Species == "setosa"
Overlapping
- F1: last line https://github.com/lpfgarcia/ECoL/blob/master/R/overlapping.R#L145 of the implementation computes 1/(aux + 1). This is not indicated in the original formulation (since F1 is the maximum of the Fisher's discriminant ratios of each feature), and gives strange results:
ECoL::overlapping(x, y, measures = "F1") 0.148504 dcme::F1(x, y) 16.66501
- F2: I have to check both implementations of this to see the differences with respect to the definition.
ECoL::overlapping(x, y, measures = "F2") 0 dcme::F2(x, y) 0.004855226
- F3: again, I still have to check implementations and the definition. Since F3 is higher when complexity is lower, I would assume for this example it should be 1 or close to 1.
ECoL::overlapping(x, y, measures = "F3") 0 dcme::F3(x, y) 1
Dimensionality
- T2: instead of the ratio of number of examples per dimension, seems to be the ratio of dimensions per example:
ECoL::dimensionality(x, y, measures = "T2") 0.02666667 dcme::T2(x) 37.5 1/ECoL::dimensionality(x, y, measures = "T2") 37.5
- T3: instead of the ratio of number of examples per PCA dimension, seems to be the ratio of PCA dimensions per example:
ECoL::dimensionality(x, y, measures = "T3") 0.01333333 dcme::T3(x) 75 1/ECoL::dimensionality(x, y, measures = "T3") 75
Balance
- C2: the last line https://github.com/lpfgarcia/ECoL/blob/master/R/balance.R#L107 of the implemented version returns 1 - 1/aux where aux already had the value of C2, according to the definition:
ECoL::balance(x, y, measures = "C2") 0.2 dcme::C2(y) 1.25
ECoL implementation without the last line
(function(y) { ii <- summary(y) nc <- length(ii) aux <- ((nc - 1)/nc) * sum(ii/(length(y) - ii)) return(aux) })(as.factor(y)) 1.25
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lpfgarcia/ECoL/issues/45, or mute the thread https://github.com/notifications/unsubscribe-auth/AVFI2mzrVKmycSUkTRyktBoGIu3ULL1pks5vUP-0gaJpZM4bjBtF .
-- Prof Ana Carolina Lorena Divisão de Ciência da Computação (IEC) Instituto Tecnológico de Aeronáutica (ITA)
Hi Ana,
Thank you very much for the quick response. It all makes sense now, I understand the alterations. Sadly, it seems that GitHub does not send email attachments, but I'll be looking forward to reading the updated version when it's published (on arxiv or elsewhere).
Best regards, David
Hi @fdavidcl
Only two additional information: The F1 implemented in the ECoL package is formulated to multiclass while in RomeroBarata/dcme is only for binary classification. In the F2, we uses a equation with a correction that was made in Souto et al. (2010) and Cummins (2013).
The other cases was explained by Ana.
Kind regards, Luis
Thanks for the clarifications!
Best regards, David
Hi, I've come across two packages for data complexity measures: yours and RomeroBarata/dcme, which I have somewhat extended. I am seeing some discrepancies among the results from both packages, and from yours and the definitions included in the paper:
I was hoping you could help me understand if/why your package may be altering some calculations and whether this is an error or an intended effect.
Thanks in advance!
David
Here's a list of the discrepancies I found, and a minimal example using the following variables:
Overlapping
1/(aux + 1)
. This is not indicated in the original formulation (since F1 is the maximum of the Fisher's discriminant ratios of each feature), and gives strange results:Dimensionality
Balance
1 - 1/aux
where aux already had the value of C2, according to the definition: