lpfgarcia / ECoL

Extended Complexity Library in R
Other
57 stars 11 forks source link

Discrepancies between implemented measures and original definitions #45

Closed fdavidcl closed 5 years ago

fdavidcl commented 5 years ago

Hi, I've come across two packages for data complexity measures: yours and RomeroBarata/dcme, which I have somewhat extended. I am seeing some discrepancies among the results from both packages, and from yours and the definitions included in the paper:

I was hoping you could help me understand if/why your package may be altering some calculations and whether this is an error or an intended effect.

Thanks in advance!

David


Here's a list of the discrepancies I found, and a minimal example using the following variables:

x <- iris[, 1:4]
y <- iris$Species == "setosa"

Overlapping

> ECoL::overlapping(x, y, measures = "F1")
0.148504
> dcme::F1(x, y)
16.66501
> ECoL::overlapping(x, y, measures = "F2")
0
> dcme::F2(x, y)
0.004855226
> ECoL::overlapping(x, y, measures = "F3")
0 
> dcme::F3(x, y)
1

Dimensionality

> ECoL::dimensionality(x, y, measures = "T2")
0.02666667 
> dcme::T2(x)
37.5
> 1/ECoL::dimensionality(x, y, measures = "T2")
37.5
> ECoL::dimensionality(x, y, measures = "T3")
0.01333333 
> dcme::T3(x)
75
> 1/ECoL::dimensionality(x, y, measures = "T3")
75 

Balance

> ECoL::balance(x, y, measures = "C2")
0.2 
> dcme::C2(y)
1.25
> # ECoL implementation without the last line
> (function(y) {
    ii <- summary(y)
    nc <- length(ii)
    aux <- ((nc - 1)/nc) * sum(ii/(length(y) - ii))
    return(aux)
  })(as.factor(y))
1.25
aclorena commented 5 years ago

Hi David

We have indeed made some modifications in the current version of the package. Now all measures assume bounded values and also, for all of them, the higher the value, the more complex the problem. The reviewed paper is attached to this e-mail. We will try to update in arxiv.

Best

Ana

Em qui, 7 de mar de 2019 às 09:01, David Charte notifications@github.com escreveu:

Hi, I've come across two packages for data complexity measures: yours and RomeroBarata/dcme https://github.com/RomeroBarata/dcme, which I have somewhat extended https://github.com/fdavidcl/dcme. I am seeing some discrepancies among the results from both packages, and from yours and the definitions included in the paper:

  • Lorena, A. C., Garcia, L. P. F., Lehmann, J., de Souto, M. C. P., and Ho, T. K. (2018). How Complex is your classification problem? A survey on measuring classification complexity. arXiv:1808.03591

I was hoping you could help me understand if/why your package may be altering some calculations and whether this is an error or an intended effect.

Thanks in advance!

David

Here's a list of the discrepancies I found, and a minimal example using the following variables:

x <- iris[, 1:4]y <- iris$Species == "setosa"

Overlapping

ECoL::overlapping(x, y, measures = "F1") 0.148504 dcme::F1(x, y) 16.66501

  • F2: I have to check both implementations of this to see the differences with respect to the definition.

ECoL::overlapping(x, y, measures = "F2") 0 dcme::F2(x, y) 0.004855226

  • F3: again, I still have to check implementations and the definition. Since F3 is higher when complexity is lower, I would assume for this example it should be 1 or close to 1.

ECoL::overlapping(x, y, measures = "F3") 0 dcme::F3(x, y) 1

Dimensionality

  • T2: instead of the ratio of number of examples per dimension, seems to be the ratio of dimensions per example:

ECoL::dimensionality(x, y, measures = "T2") 0.02666667 dcme::T2(x) 37.5 1/ECoL::dimensionality(x, y, measures = "T2") 37.5

  • T3: instead of the ratio of number of examples per PCA dimension, seems to be the ratio of PCA dimensions per example:

ECoL::dimensionality(x, y, measures = "T3") 0.01333333 dcme::T3(x) 75 1/ECoL::dimensionality(x, y, measures = "T3") 75

Balance

ECoL::balance(x, y, measures = "C2") 0.2 dcme::C2(y) 1.25

ECoL implementation without the last line

(function(y) { ii <- summary(y) nc <- length(ii) aux <- ((nc - 1)/nc) * sum(ii/(length(y) - ii)) return(aux) })(as.factor(y)) 1.25

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lpfgarcia/ECoL/issues/45, or mute the thread https://github.com/notifications/unsubscribe-auth/AVFI2mzrVKmycSUkTRyktBoGIu3ULL1pks5vUP-0gaJpZM4bjBtF .

-- Prof Ana Carolina Lorena Divisão de Ciência da Computação (IEC) Instituto Tecnológico de Aeronáutica (ITA)

fdavidcl commented 5 years ago

Hi Ana,

Thank you very much for the quick response. It all makes sense now, I understand the alterations. Sadly, it seems that GitHub does not send email attachments, but I'll be looking forward to reading the updated version when it's published (on arxiv or elsewhere).

Best regards, David

lpfgarcia commented 5 years ago

Hi @fdavidcl

Only two additional information: The F1 implemented in the ECoL package is formulated to multiclass while in RomeroBarata/dcme is only for binary classification. In the F2, we uses a equation with a correction that was made in Souto et al. (2010) and Cummins (2013).

The other cases was explained by Ana.

Kind regards, Luis

fdavidcl commented 5 years ago

Thanks for the clarifications!

Best regards, David