New feature: Voronoi algorithm for labeling groups of data points

slowkow / ggrepel

:round_pushpin: Repel overlapping text labels away from each other in your ggplot2 figures.

https://ggrepel.slowkow.com

GNU General Public License v3.0

1.22k stars 97 forks source link

New feature: Voronoi algorithm for labeling groups of data points #127

Closed eliocamp closed 5 years ago

eliocamp commented 5 years ago

Following up from this isssue, would you be open to a PR that implements this algorithm for automatically placing labels?

slowkow commented 5 years ago

Elio, yes! I would be delighted to review your pull request. And thank you for asking for input from others. I really appreciate that.

I also thought about Voronoi tessellations and ggrepel in a previous issue: https://github.com/slowkow/ggrepel/issues/48#issuecomment-405772908

As you have already shown, the deldir package implements exactly the functionality we need. I think it is safe to add deldir as a dependency for ggrepel, because it seems to be a well-maintained package since 2002-02-24.

For your pull request, you might consider how to best take advantage of the Voronoi tessellation to enhance the experience for all ggrepel users. Some thoughts:

ggrepel currently iterates through a physical simulation to position the labels, and the initial positions of text labels are the positions of the data points (with a tiny bit of random jitter). \ The number of iterations in the physical simulation might be reduced if we start with positions from the Voronoi tessellation instead. Sometimes, we might be able to skip the physical simulation entirely. I don't know how this will impact performance, but I am excited to find out if you want to try!
Perhaps ggrepel can try to be smarter when a user tries to label too many points. Right now, we get "hedgehogs" or "hairballs" when the user tries to label hundreds of points. \ I think users might prefer ggrepel to only create labels for the low-density points by default, instead of labeling everything. This is what we discussed in #48. \ If the low-density feature is implemented, then we might adjust some parameter to return a hairball only if desired -- something like max.labels = 10 by default and max.labels = NA to get a hairball?

I'm very excited to see what you come up with! Good luck.

eliocamp commented 5 years ago

Yey!

I was thinking that in this case there's no need for the physical model. Just moving away from the datapoint and towards the voronoi centroid seems smart enough.

The original algo only labels one point per group. It seems that you would like to extend it to labeling multiple points.

I don't know where to implement it, though. Passing it as position is elegant for filtering which points to label, but then doesn't lend itself well to the rest of the algorithm. Maybe it makes sense to create a whole different geom? Using geom_text_repel() seems a bit weird because it's a totally diferent algorithm that doesn't rely on the "repel" part.

slowkow commented 5 years ago

Sorry, I mentioned too many things at once in my comment.

Let me take a step back and focus on the one feature you mentioned. Is this what you're after?

Problem: The user has lots of points colored by group, and they want to label each group directly on the plotting area without overlapping any data points.

Solution: Something similar to the Voronoi code that you posted in https://github.com/tidyverse/ggplot2/issues/3093.

Does directlabels already solve this exact problem? Do we need another solution in ggrepel? I haven't tried directlabels, so I don't know.

I don't know if we need a new geom or not. Feel free to try anything you want.

My understanding is that your position_voronoi() is taking a dataframe with n rows and returning a new dataframe with g rows, one row for each group. The problem with this approach is that the n data points do not repel the text labels.

Instead, what if we return a new dataframe with n + g rows?

The new dataframe would have n rows corresponding to real data points. These points would be unlabeled, but they would repel the group text labels.
The additional g rows would have the names of the groups as labels. They would not repel the text labels (they should have size 0).

I tried to do this in the code below, but I don't think it is working as expected. I want the g points to have size 0 and the rest of the points to have size > 0, and I can't do that without additional modifications inside ggrepel.

I'm also currently working on a new branch called point.size that allows the user to specify the size of each data point. (Right now all points are assumed to be equal size.) Maybe this new feature will help with our goal in this issue?

Here's my unsuccessful attempt:

library(ggrepel)
#> Loading required package: ggplot2
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

set.seed(1231)
d <- data.frame(
  x = rnorm(30), y = rnorm(30), group = letters[1:3], stringsAsFactors = FALSE
)
d$group <- factor(d$group, levels = c("a", "b", "c", ""))

d_mean <- d %>%
  dplyr::group_by(group) %>%
  dplyr::summarise(
    x = mean(x),
    y = mean(y)
  )

d_text <- rbind(
  d %>% mutate(group = ""),
  d_mean
)
d_text$group <- factor(d_text$group, levels = c("a", "b", "c", ""))

ggplot() +
  geom_point(
    data = d,
    mapping = aes(x, y, color = group)
  ) +
  geom_label_repel(
    data = d_text,
    mapping = aes(x, y, color = group, label = group),
    size = 6,
    max.iter = 1e5,
    min.segment.length = 0,
    # box.padding = unit(1, "lines"),
    # point.padding = unit(1, "lines"),
    seed = 1231
  ) +
  theme_gray(base_size = 20)
#> Warning: Removed 1 rows containing missing values (geom_label_repel).

^{Created on 2019-01-21 by the reprex package (v0.2.1)}

eliocamp commented 5 years ago

The original algorithm was envisioned for the first situation you mention. directlabels is another candidate for where to put it (it would be another positioning method). The problem there is that it doesn't check for overlap (as far as I know).

I guess the issue is that ggrepel is meant for labeling datapoints but the algo is for labeling groups, which is closer to what directlabels tries to do. If you think directlabels is a better fit I'll go knocking the door there :rofl: .

The points.size idea seems unrelated, but really cool. It would be neat to have "phantom" points that repel labels but are not labeled themselves.

slowkow commented 5 years ago

The "phantom points" feature already works, just set the label to "". Here is an example in the vignette:

https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html#hide-some-of-the-labels

Here's what I get with directlabels:

library(ggplot2)
library(directlabels)

ggplot(iris, aes(Petal.Length, Sepal.Length)) +
  geom_point(aes(color = Species)) +
  geom_dl(aes(color = Species, label = Species), method = "smart.grid")


set.seed(1231)
d <- data.frame(
  x = rnorm(30), y = rnorm(30), group = letters[1:3], stringsAsFactors = FALSE
)

ggplot(d, aes(x, y, color = group)) +
  geom_point() +
  geom_dl(aes(label = group), method = "smart.grid")

^{Created on 2019-01-21 by the reprex package (v0.2.1)}

The problem there is that it doesn't check for overlap (as far as I know).

This is not true. Here's what I get when I resize:

directlabels2

slowkow commented 5 years ago

In summary...

I still think that taking advantage of Voronoi tessellation can potentially improve how ggrepel works. However, my ideas on this topic seem to be unrelated to your original goal.
directlabels seems to do exactly what you want. Check out the extensive list of examples. If you want to reimplement similar behavior in ggrepel, please feel free -- it's always nice to have options.

I'm actually impressed with directlabels now that I have tried it and looked at some examples. The interface and goals are a bit different than ggrepel, but it has a great range of features that ggrepel does not cover.

eliocamp commented 5 years ago

Ah, cool. It seems that overlap checking is backed into smart.grid. I actually tested using position_voronoi() as a positioning method and it overlapped with the points. smart.grid doesn't use the same algorithm so it may still be useful to add it. But now I see that directiabels is a better fit, I'll look into that.

The phantom points example gave me the idea of applying the voronoi algorithm as a filter to remove labels. Something like this.

library(ggplot2)
library(ggrepel)

set.seed(42)

is.voronoi_max <- function(x, y, group) {
  x <- scales::rescale(x, c(0, 1))
  y <- scales::rescale(y, c(0, 1))
  rw <-  c(range(x), range(y))
  del <- deldir::deldir(x, y, z = group, rw =  rw, 
                                   suppressMsge = TRUE)
  del_summ <- del$summary
  del_summ$index <- seq_len(nrow(del_summ))

  index <- vapply(split(del_summ, del_summ$z), 
                  function(x) x$index[which.max(x$dir.area)], 1)
  seq_len(length(x)) %in% index
}

ggplot(mtcars, aes(wt, mpg, color = factor(gear))) +
  geom_point() +
  geom_text_repel(aes(label = ifelse(is.voronoi_max(wt, mpg, gear), gear, "")))

^{Created on 2019-01-21 by the reprex package (v0.2.1.9000)}

slowkow commented 5 years ago

I have changed the title of this issue to:

New feature: Voronoi algorithm for labeling groups of data points

You never said it explicitly, but I guess this is your goal. Is that right? If not, then please change the title to clearly explain the purpose of this discussion.

Elio, thank you for sharing is.voronoi_max()! It works surprisingly well and I like it.

However, I noticed I can get a similar result without deldir.

I'd like to hear your thoughts on these questions:

Is there a case where deldir works and this code does not?
Do you think it is worthwhile to add a dependency on deldir?
Do users need an additional function for labeling groups of data points?

ggrepel without deldir

library(ggrepel)
#> Loading required package: ggplot2

seen <- function(x) {
  o <- order(x)
  duplicated(x[o])[order(o)]
}

ggplot(mtcars, aes(wt, mpg, color = factor(gear))) +
  geom_point() +
  geom_text_repel(aes(label = ifelse(!seen(gear), gear, "")))


ggplot(iris, aes(Petal.Length, Sepal.Length, color = Species)) +
  geom_point() +
  geom_text_repel(aes(
    label = ifelse(!seen(Species), as.character(Species), "")
  ))

^{Created on 2019-01-24 by the reprex package (v0.2.1)}

eliocamp commented 5 years ago

That's a much better title.

I'm don't understand exactly what seen() does. It labels the first point from each group? It will probably will work as fine as selecting a random point. Seems suboptimal when the first points are very close or identical.

library(ggplot2)
library(ggrepel)

seen <- function(x) {
  o <- order(x)
  duplicated(x[o])[order(o)]
}

is.voronoi_max <- function(x, y, group) {
  x <- scales::rescale(x, c(0, 1))
  y <- scales::rescale(y, c(0, 1))
  rw <-  c(range(x), range(y))
  del <- deldir::deldir(x, y, z = group, rw =  rw, 
                        suppressMsge = TRUE)
  del_summ <- del$summary
  del_summ$index <- seq_len(nrow(del_summ))

  index <- vapply(split(del_summ, del_summ$z), 
                  function(x) x$index[which.max(x$dir.area)], 1)
  seq_len(length(x)) %in% index
}

set.seed(42)
N <- 50
t <- 1:N
df <- data.frame(t = t, 
                 y = c(cumsum(rnorm(N))), 
                 x = c(cumsum(rnorm(N))))
df <- reshape2::melt(df, "t")

ggplot(df, aes(t, value)) +
  geom_point(aes(color = variable)) +
  geom_text_repel(aes(color = variable,
                      label = ifelse(!seen(variable), variable, ""))) +
  geom_label_repel(aes(color = variable, 
                      label = ifelse(is.voronoi_max(t, value, variable),
                                     variable, "")))

^{Created on 2019-01-25 by the reprex package (v0.2.1.9000)}

In light of the discussion of directlabels, I'm convinced that labeling of groups of data points is not necessary the role of ggrepel. I believe a better framing would be how to filter which datapoints to label, like the low density points issue you linked to.

jianshu93 commented 11 months ago

Hello all, Thanks for the useful examples. How do I add an indicator line to link the center (or most densified region) of a group with the group label like ggrepel? In many real world cases, e.g., there are more than 20 groups and some may not be clear which label is which without labeling.

Thanks,

Jianshu

slowkow commented 11 months ago

@jianshu93

Hi Jianshu,

Consider using two data frames, the first one for the points and the second one for the centroids.

You can compute the centroids however you want (mean, median, etc.).

Here is an example:


library(ggplot2)
library(magrittr)
library(dplyr)

df1 <- mtcars
df2 <- mtcars %>% group_by(gear) %>% summarize(mpg = mean(mpg), wt = mean(wt))

ggplot(df1) +
  aes(wt, mpg, color = factor(gear)) +
  geom_point(size = 2) +
  geom_text_repel(data = df2, aes(label = gear), size = 8, min.segment.length = 0, nudge_x = 1) +
  theme_gray(base_size = 20)

jianshu93 commented 11 months ago

Hello @slowkow ,

Thank you for the quick response. I have some example to share and perhaps we can determine which is the best way to label complicated data:

library(ggplot2) library(RColorBrewer)

is.voronoi_max <- function(x, y, group) { x <- scales::rescale(x, c(0, 1)) y <- scales::rescale(y, c(0, 1)) rw <- c(range(x), range(y)) del <- deldir::deldir(x, y, z = group, rw = rw, suppressMsge = TRUE) del_summ <- del$summary del_summ$index <- seq_len(nrow(del_summ))

index <- vapply(split(del_summ, del_summ$z), function(x) x$index[which.max(x$dir.area)], 1) seq_len(length(x)) %in% index }

C elegan annembed

embed_elegan <- read.table("C_elegan_embedded.csv",sep=",", head=T) head(embed_elegan) library(ggplot2)

good.shapes = c(1:25,33:127) embed_elegan$embryo.time.new = as.factor(embed_elegan$embryo.time.new)

df2 <- embed_elegan %>% group_by(cell.type) %>% summarize(annembed_1 = median(annembed_1), annembed_2 = median(annembed_2))

a = ggplot(data=embed_elegan,aes(x=annembed_1,y=annembed_2, shape=cell.type))+geom_point(aes(color=embryo.time.new,shape=cell.type),size=0.005, alpha=0.5) + scale_color_brewer(palette="Paired") + scale_shape_manual(values=good.shapes[1:37])

geom_dl() example

a + theme_bw() + xlab("annembed_1") + ylab("annembed_2") + theme(legend.position="none") + geom_dl(aes(label = cell.type), method = "smart.grid")

use your example

a + geom_text_repel(data = df2, aes(label = cell.type), size = 3, min.segment.length = 0, nudge_x = 1) + theme_bw() + xlab("annembed_1") + ylab("annembed_2") + theme(legend.position="none")

use the voronoi

a + geom_text_repel(aes(label = ifelse(is.voronoi_max(annembed_1, annembed_2, cell.type), cell.type, "")), max.overlaps=Inf) + theme_bw() + xlab("annembed_1") + ylab("annembed_2") + theme(legend.position="none")

It seems the dl example is good but no indicator line, voronoi has some problem when one category of data (in my case is the shape) is not well clustered but can be in several places. Your example seems to be perfect (I use median).

Thanks,

Jianshu

voronoi_example.pdf your_example.pdf

geom_dl_example.pdf

C_elegan_embedded.csv.zip

jianshu93 commented 11 months ago

I am also think about 75% confidence interval ellipse centroid (consider both x and y) to label, which can also be useful.

Thanks,

Jianshu

aphalo commented 11 months ago

The precomputation can be avoided using stat_centroid() from package 'ggpp'. Here I use the default mean() but it works similarly to stat_summary() but applies the function to both x and y in parallel by group.

library(ggplot2)
library(ggpp)
#> Registered S3 methods overwritten by 'ggpp':
#>   method                  from   
#>   heightDetails.titleGrob ggplot2
#>   widthDetails.titleGrob  ggplot2
#> 
#> Attaching package: 'ggpp'
#> The following object is masked from 'package:ggplot2':
#> 
#>     annotate
library(ggrepel)

df1 <- mtcars

# with repulsion
ggplot(df1) +
  aes(wt, mpg, color = factor(gear), label = gear) +
  geom_point(size = 2) +
  stat_centroid(geom = "text_repel", position = position_nudge_keep(x = 1), 
                size = 8, min.segment.length = 0) +
  theme_gray(base_size = 20)


# no repulsion
ggplot(df1) +
  aes(wt, mpg, color = factor(gear), label = gear) +
  geom_point(size = 2) +
  stat_centroid(geom = "text_s", position = position_nudge_keep(x = 1), 
                size = 8, min.segment.length = 0) +
  theme_gray(base_size = 20)

^{Created on 2023-11-09 with reprex v2.0.2}