masspastore / overlapping

Estimation of Overlapping in Empirical Distributions.
GNU General Public License v3.0
8 stars 3 forks source link

Empirical overlap != Density Overlap ? #10

Open russellpierce opened 6 years ago

russellpierce commented 6 years ago

It might be that I am misunderstanding your package. My intuition was that a uniform distribution [0,1] should overlap with a uniform distribution [0.5, 1.5] by 50%. However...

result <- replicate(1000, {
  x <- list(X1 = runif(100000), X2 = runif(10000, min = .5, max=1.5))
  out <- overlap( x, plot = FALSE)
  out$OV # estimated overlapped areas
})
mean(result)
median(result)

.... shows a bias towards indicating a high proportion of overlap than 50%. Looking at the graphical output and skimming the code I gather that this is because you're slicing up the results of a call to density to find overlap. However, density applies a gaussian kernel that can't adequately represent the sharp edges of the uniform distributions. After applying a density function with a kernel like that to what extent can we still call the overlap 'empirical'?

masspastore commented 6 years ago

Thanks @russellpierce, I agree with you.

The problem depends on the density function, you can verify this with the following code. You will observe that x values returned are outside natural limits 0-1

set.seed(1) density(runif(1000))

And consequently this had an influence on the overlapping estimation.

In order to fix this problem, I changed the function overlap by allowing to add all additional arguments of density function (for example the choice of kernel) and, in particular, the possibility of defining the empirical bound of input variables. I think that the estimate is much more improved now, even though it's still an approximation. In my experience (working with psychological and social data), this shouldn't be particularly problematic.

russellpierce commented 6 years ago

I agree that the problem comes from the density function and I commend your improvements.

I acknowledge that you've almost certainly thought about this more than me. So, what I'm trying to do here is get to a place where I can feel 👍 about saying 'works as expected' and/or we both learn something. Thank you for your patience.

I'm still a bit stuck about whether this procedure reflects empirical overlap when data has to go through a density function of any kind. From your description, what I'd expect would be a bit more like...

library(dplyr)
d <- list(X1 = runif(1000000), X2 = runif(100000, min = .5, max=1.5))
# as with your function, we need a set of bins to evaluate the result under
nbins <- 1000

# These are function generators, making an ECDF for each dataset
emperical_1 <- ecdf(d$X1)
emperical_2 <- ecdf(d$X2)

# Establishing the range of values to build bins over
min_val <- min(c(d$X1, d$X2))
max_val <- max(c(d$X1, d$X2))

# ecdf yields a cumulative mass
cumulative_mass_1 <- lapply(seq(min_val, max_val, length.out = nbins), emperical_1)
cumulative_mass_2 <- lapply(seq(min_val, max_val, length.out = nbins), emperical_2)

# The mass in any given bin is the difference between the cumulative at the end of this bin and the cumulative at the end of the previous bin.
bin_mass_1 <- unlist(cumulative_mass_1)-dplyr::lag(unlist(cumulative_mass_1))
bin_mass_2 <- unlist(cumulative_mass_2)-dplyr::lag(unlist(cumulative_mass_2))

# The overlap is the sum of the lower mass all bin pairs
sum(pmin(bin_mass_1, bin_mass_2), na.rm = TRUE)

... but admittedly that skews low in overlap estimation. Can you describe a bit about why the function you propose is better in practice / more intuitive over something like what I've sketched above?

masspastore commented 6 years ago

@russellpierce,
I agree with you that there are many ways to compute the overlap.

My algorithm could obviously be further improved or changed (I changed it after your previous observation). But, my aim was to provide a function easy-to-use primarily for scholars in Psychological Science (the field in which I work) that usually are not so confident with R or statistical algorithms. Moreover, in Psychological Science variables are usually discrete and bounded, with samples generally not so huge. Consequently I think that my approach could be a good compromise between the quality of estimates approximation and the easiness of the algorithm.

In particular, I prefer to use the density() function because it is in the stats base package, instead of requiring loading additional packages (such as, in your example, dplyr).

Example 1

Let's consider this example, a typical case in Psychological research: we have discrete scores of a test ranging from 0 to 30, obtained on a sample of 22 males and 59 females. In the following code we define and plot data:

exams <- data.frame(
  scores = c(6, 11, 12, 20, 5, 12, 28, 20, 25, 14, 19, 25, 21, 25, 18, 17, 21, 16, 19, 23, 15, 29, 14, 23, 14, 29, 5, 11, 13, 15, 27, 18, 28, 23, 14, 10, 24, 15, 26, 11, 20, 18, 14, 20, 29, 20, 0, 22, 22, 12, 15, 20, 19, 22, 21, 20, 11, 17, 10, 17, 15, 12, 26, 22, 27, 14, 29, 20, 23, 22, 16, 19, 16, 16, 21, 19, 26, 20, 7, 9, 20),
  gender = c('females', 'females', 'males', 'males', 'males', 'females', 'females', 'males', 'males', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'males', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'males', 'females', 'females', 'females', 'females', 'females', 'females', 'males', 'females', 'females', 'females', 'females', 'males', 'females', 'males', 'males', 'females', 'females', 'females', 'males', 'females', 'females', 'males', 'males', 'females', 'males', 'females', 'males', 'males', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'females', 'males', 'males', 'females', 'females', 'females', 'males', 'males', 'females', 'males', 'females', 'females', 'females')
)

library(cowplot)

plot_grid(
  ggplot(exams, aes(scores,fill=gender))+geom_bar()+facet_wrap(~gender)+guides(fill=FALSE),
  ggplot(exams, aes(scores,fill=gender))+geom_density(alpha=.3)+guides(fill=FALSE)
)

figure1

By using your code we obtain:

d <- list( X1 = exams$scores[exams$gender=="males"], 
                 X2 = exams$scores[exams$gender=="females"])

# as with your function, we need a set of bins to evaluate the result under
nbins <- 1000

# These are function generators, making an ECDF for each dataset
emperical_1 <- ecdf(d$X1)
emperical_2 <- ecdf(d$X2)

# Establishing the range of values to build bins over
min_val <- min(c(d$X1, d$X2))
max_val <- max(c(d$X1, d$X2))

# ecdf yields a cumulative mass
cumulative_mass_1 <- lapply(seq(min_val, max_val, length.out = nbins), emperical_1)
cumulative_mass_2 <- lapply(seq(min_val, max_val, length.out = nbins), emperical_2)

# The mass in any given bin is the difference between the cumulative at the end of this bin and the cumulative at the end of the previous bin.
bin_mass_1 <- unlist(cumulative_mass_1)-dplyr::lag(unlist(cumulative_mass_1))
bin_mass_2 <- unlist(cumulative_mass_2)-dplyr::lag(unlist(cumulative_mass_2))

# The overlap is the sum of the lower mass all bin pairs
sum(pmin(bin_mass_1, bin_mass_2), na.rm = TRUE)

[1] 0.5716487

Using the overlap() (revised) function we obtain:

overlap(d)$OV
    X1-X2 
0.6597147 

If we manually set range of tests scores (using boundaries, as implemented in the revised overlapping package), the result is:

boundList <- list( from = 0, to = 30 )
overlap(d, boundaries = boundList )$OV
    X1-X2 
0.6883329

Example 2

Let's consider another simple example, again with a small sample size.

set.seed( 1 )
x <- rnorm( 50 )
y <- rnorm( 50, 3 )
d <- data.frame( X1 = x, X2 = y )

dplot <- stack(d)
ggplot(dplot, aes(values,fill=ind))+geom_density(alpha=.3)

figure2

We can see that the overlapped area is small, but not zero.

By using your code we obtain:

# These are function generators, making an ECDF for each dataset
emperical_1 <- ecdf(d$X1)
emperical_2 <- ecdf(d$X2)

# Establishing the range of values to build bins over
min_val <- min(c(d$X1, d$X2))
max_val <- max(c(d$X1, d$X2))

# ecdf yields a cumulative mass
cumulative_mass_1 <- lapply(seq(min_val, max_val, length.out = nbins), emperical_1)
cumulative_mass_2 <- lapply(seq(min_val, max_val, length.out = nbins), emperical_2)

# The mass in any given bin is the difference between the cumulative at the end of this bin and the cumulative at the end of the previous bin.
bin_mass_1 <- unlist(cumulative_mass_1)-dplyr::lag(unlist(cumulative_mass_1))
bin_mass_2 <- unlist(cumulative_mass_2)-dplyr::lag(unlist(cumulative_mass_2))

# The overlap is the sum of the lower mass all bin pairs
sum(pmin(bin_mass_1, bin_mass_2), na.rm = TRUE)

[1] 0

Note that result remain the same even increasing the number of bins.

With the overlap() function:

overlap(d)$OV
     X1-X2 
0.06580136

That seems a more reliable result.

In sum, I agree with you that the algorithm produces non perfect estimations, and sure next improvements will be direct to improve these estimations, but your example represents a very specific case that does not necessarily applies to everyday data in social science and psychology. After these recent improvements, made thanks to your comments during this revision process, I think that the overlap function represents a sufficiently accurate way for estimating overlap, in particular for data which I (and other scholars in psychological field) deal with (e.g. samples not particularly huge and discrete and bounded variables).

russellpierce commented 6 years ago

I think there are some problems with your responses which I address in "asides" below. That being said, there is nothing down there alone that should stand in the way of your submission. I think my remaining concerns can be mostly resolved by clarifying the language around your package. Specifically, when you say "empirical distributions" I think you really mean "kernel density estimations from empirical data". If you consistently make that distinction then I should be in a better place to sign off on the function doing what it is advertised to do.

Some examples of things that need modification:

  1. This package is not "a R package for estimating the overlapping area of two or more empirical distributions". Instead it is "an R package for estimating the overlapping area of two or more kernel density estimations from empirical data". [Small English grammar note, use 'an' before R because R is pronounced with an initial vowel sound - what a weird language]
  2. Throughout your paper and function descriptions must be updated to reflect that distinction. For example, replace "In practice we estimate the degree of similarity by exploring the overlapping between group scores" with "In practice we estimate the degree of overlap between groups as the overlap between their kernel density estimates".
  3. Drop or adjust Figure 1. The discrete nature of the histogram with relation the provided data implies a technique that you aren't using.
  4. Note that your nbins parameter doesn't do what it claims because it passes that argument forward to density which then (for values greater than 512) rounds to a power of 2. For this I'd recommend bringing forward some of the density help text to adequately explain what your function is doing.
  5. In the github readme "a data frame with informations used for" -> "a data frame with information used for". In English information doesn't need the s.
  6. Update README and function documentation on github to reflect the Kernel Density Estimation vs Empirical distinction.

Asides

I hope you recognize that your Example 2 embeds the answer in the question. Vis, you argue for non-zero overlap because of a visualization that is a result of a density function. The results you see there are a consequence of the binning methodology. Increasing the number of bins can only decrease overlap in both measures. Decreasing the number of bins can only increase overlap in both measures. When increasing bins, your method does not decrease to zero because the bandwidth selected by the density function is another form of binning.

Your argument against the algorithm I put forth on the basis of dplyr::first() is way beside the point because first() is way less complex than density(). However, I'll grant my raising of an alternative algorithm was beside the point too. So, I'll skip discussing that approach further.

My PhD is in Cognitive Psychology and my dissertation work used experimental methods with reasonably small sample sizes. It feels to me like you're doing psychological science a disservice with your characterization of the discipline as being stats/tech unsavvy. Some very advanced R packages and other statistical tools have come from our community.

soodoku commented 6 years ago

@russellpierce: appreciate your thoughtfulness and care in reviewing. Nicely done! cc @arfon

masspastore commented 6 years ago

thanks @russellpierce,

I updated everything following your suggestions. Now I hope it's fine.

soodoku commented 5 years ago

hey @masspastore: I have signed off for JOSS. But I think it may be useful to highlight that the results being produced are approximate in the readme. And to also note any systematic issues. I would recommend just using the examples here to illustrate the issues. Problem is = a bunch of people will just use the package without being aware of the issues if you don't put it in a prominent place.

masspastore commented 5 years ago

@soodoku: I updated the readme file by adding a note about this issue. Thanks.