Discrepancies between implemented measures and original definitions

fdavidcl commented 5 years ago

Hi, I've come across two packages for data complexity measures: yours and RomeroBarata/dcme, which I have somewhat extended. I am seeing some discrepancies among the results from both packages, and from yours and the definitions included in the paper:

Lorena, A. C., Garcia, L. P. F., Lehmann, J., de Souto, M. C. P., and Ho, T. K. (2018). How Complex is your classification problem? A survey on measuring classification complexity. arXiv:1808.03591

I was hoping you could help me understand if/why your package may be altering some calculations and whether this is an error or an intended effect.

Thanks in advance!

David

Here's a list of the discrepancies I found, and a minimal example using the following variables:

x <- iris[, 1:4]
y <- iris$Species == "setosa"

Overlapping

F1: last line of the implementation computes 1/(aux + 1). This is not indicated in the original formulation (since F1 is the maximum of the Fisher's discriminant ratios of each feature), and gives strange results:

> ECoL::overlapping(x, y, measures = "F1")
0.148504
> dcme::F1(x, y)
16.66501

F2: I have to check both implementations of this to see the differences with respect to the definition.

> ECoL::overlapping(x, y, measures = "F2")
0
> dcme::F2(x, y)
0.004855226

F3: again, I still have to check implementations and the definition. Since F3 is higher when complexity is lower, I would assume for this example it should be 1 or close to 1.

> ECoL::overlapping(x, y, measures = "F3")
0 
> dcme::F3(x, y)
1

Dimensionality

T2: instead of the ratio of number of examples per dimension, seems to be the ratio of dimensions per example:

> ECoL::dimensionality(x, y, measures = "T2")
0.02666667 
> dcme::T2(x)
37.5
> 1/ECoL::dimensionality(x, y, measures = "T2")
37.5

T3: instead of the ratio of number of examples per PCA dimension, seems to be the ratio of PCA dimensions per example:

> ECoL::dimensionality(x, y, measures = "T3")
0.01333333 
> dcme::T3(x)
75
> 1/ECoL::dimensionality(x, y, measures = "T3")
75

Balance

C2: the last line of the implemented version returns 1 - 1/aux where aux already had the value of C2, according to the definition:

> ECoL::balance(x, y, measures = "C2")
0.2 
> dcme::C2(y)
1.25
> # ECoL implementation without the last line
> (function(y) {
    ii <- summary(y)
    nc <- length(ii)
    aux <- ((nc - 1)/nc) * sum(ii/(length(y) - ii))
    return(aux)
  })(as.factor(y))
1.25

aclorena commented 5 years ago

Hi David

We have indeed made some modifications in the current version of the package. Now all measures assume bounded values and also, for all of them, the higher the value, the more complex the problem. The reviewed paper is attached to this e-mail. We will try to update in arxiv.

Best

Ana

Em qui, 7 de mar de 2019 às 09:01, David Charte notifications@github.com escreveu:

Hi, I've come across two packages for data complexity measures: yours and RomeroBarata/dcme https://github.com/RomeroBarata/dcme, which I have somewhat extended https://github.com/fdavidcl/dcme. I am seeing some discrepancies among the results from both packages, and from yours and the definitions included in the paper:

Lorena, A. C., Garcia, L. P. F., Lehmann, J., de Souto, M. C. P., and Ho, T. K. (2018). How Complex is your classification problem? A survey on measuring classification complexity. arXiv:1808.03591

I was hoping you could help me understand if/why your package may be altering some calculations and whether this is an error or an intended effect.

Thanks in advance!

David

Here's a list of the discrepancies I found, and a minimal example using the following variables:

x <- iris[, 1:4]y <- iris$Species == "setosa"

Overlapping

F1: last line https://github.com/lpfgarcia/ECoL/blob/master/R/overlapping.R#L145 of the implementation computes 1/(aux + 1). This is not indicated in the original formulation (since F1 is the maximum of the Fisher's discriminant ratios of each feature), and gives strange results:

ECoL::overlapping(x, y, measures = "F1") 0.148504 dcme::F1(x, y) 16.66501

F2: I have to check both implementations of this to see the differences with respect to the definition.

ECoL::overlapping(x, y, measures = "F2") 0 dcme::F2(x, y) 0.004855226

F3: again, I still have to check implementations and the definition. Since F3 is higher when complexity is lower, I would assume for this example it should be 1 or close to 1.

ECoL::overlapping(x, y, measures = "F3") 0 dcme::F3(x, y) 1

Dimensionality

T2: instead of the ratio of number of examples per dimension, seems to be the ratio of dimensions per example:

ECoL::dimensionality(x, y, measures = "T2") 0.02666667 dcme::T2(x) 37.5 1/ECoL::dimensionality(x, y, measures = "T2") 37.5

T3: instead of the ratio of number of examples per PCA dimension, seems to be the ratio of PCA dimensions per example:

ECoL::dimensionality(x, y, measures = "T3") 0.01333333 dcme::T3(x) 75 1/ECoL::dimensionality(x, y, measures = "T3") 75

Balance

C2: the last line https://github.com/lpfgarcia/ECoL/blob/master/R/balance.R#L107 of the implemented version returns 1 - 1/aux where aux already had the value of C2, according to the definition:

ECoL::balance(x, y, measures = "C2") 0.2 dcme::C2(y) 1.25

ECoL implementation without the last line

(function(y) { ii <- summary(y) nc <- length(ii) aux <- ((nc - 1)/nc) * sum(ii/(length(y) - ii)) return(aux) })(as.factor(y)) 1.25

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lpfgarcia/ECoL/issues/45, or mute the thread https://github.com/notifications/unsubscribe-auth/AVFI2mzrVKmycSUkTRyktBoGIu3ULL1pks5vUP-0gaJpZM4bjBtF .

-- Prof Ana Carolina Lorena Divisão de Ciência da Computação (IEC) Instituto Tecnológico de Aeronáutica (ITA)

fdavidcl commented 5 years ago

Hi Ana,

Thank you very much for the quick response. It all makes sense now, I understand the alterations. Sadly, it seems that GitHub does not send email attachments, but I'll be looking forward to reading the updated version when it's published (on arxiv or elsewhere).

Best regards, David

lpfgarcia commented 5 years ago

Hi @fdavidcl

Only two additional information: The F1 implemented in the ECoL package is formulated to multiclass while in RomeroBarata/dcme is only for binary classification. In the F2, we uses a equation with a correction that was made in Souto et al. (2010) and Cummins (2013).

The other cases was explained by Ana.

Kind regards, Luis

fdavidcl commented 5 years ago

Thanks for the clarifications!

Best regards, David

lpfgarcia / ECoL