Closed eliocamp closed 5 years ago
Elio, yes! I would be delighted to review your pull request. And thank you for asking for input from others. I really appreciate that.
I also thought about Voronoi tessellations and ggrepel in a previous issue: https://github.com/slowkow/ggrepel/issues/48#issuecomment-405772908
As you have already shown, the deldir package implements exactly the functionality we need. I think it is safe to add deldir as a dependency for ggrepel, because it seems to be a well-maintained package since 2002-02-24.
For your pull request, you might consider how to best take advantage of the Voronoi tessellation to enhance the experience for all ggrepel users. Some thoughts:
ggrepel currently iterates through a physical simulation to position the labels, and the initial positions of text labels are the positions of the data points (with a tiny bit of random jitter). \ The number of iterations in the physical simulation might be reduced if we start with positions from the Voronoi tessellation instead. Sometimes, we might be able to skip the physical simulation entirely. I don't know how this will impact performance, but I am excited to find out if you want to try!
Perhaps ggrepel can try to be smarter when a user tries to label too many points. Right now, we get "hedgehogs" or "hairballs" when the user tries to label hundreds of points.
\
I think users might prefer ggrepel to only create labels for the low-density points by default, instead of labeling everything. This is what we discussed in #48.
\
If the low-density feature is implemented, then we might adjust some parameter to return a hairball only if desired -- something like max.labels = 10
by default and max.labels = NA
to get a hairball?
I'm very excited to see what you come up with! Good luck.
Yey!
I was thinking that in this case there's no need for the physical model. Just moving away from the datapoint and towards the voronoi centroid seems smart enough.
The original algo only labels one point per group. It seems that you would like to extend it to labeling multiple points.
I don't know where to implement it, though. Passing it as position
is elegant for filtering which points to label, but then doesn't lend itself well to the rest of the algorithm. Maybe it makes sense to create a whole different geom? Using geom_text_repel()
seems a bit weird because it's a totally diferent algorithm that doesn't rely on the "repel" part.
Sorry, I mentioned too many things at once in my comment.
Let me take a step back and focus on the one feature you mentioned. Is this what you're after?
Problem: The user has lots of points colored by group, and they want to label each group directly on the plotting area without overlapping any data points.
Solution: Something similar to the Voronoi code that you posted in https://github.com/tidyverse/ggplot2/issues/3093.
Does directlabels already solve this exact problem? Do we need another solution in ggrepel? I haven't tried directlabels, so I don't know.
I don't know if we need a new geom or not. Feel free to try anything you want.
My understanding is that your position_voronoi()
is taking a dataframe with n
rows and returning a new dataframe with g
rows, one row for each group. The problem with this approach is that the n
data points do not repel the text labels.
Instead, what if we return a new dataframe with n + g
rows?
The new dataframe would have n
rows corresponding to real data points. These points would be unlabeled, but they would repel the group text labels.
The additional g
rows would have the names of the groups as labels. They would not repel the text labels (they should have size 0).
I tried to do this in the code below, but I don't think it is working as expected. I want the g
points to have size 0 and the rest of the points to have size > 0, and I can't do that without additional modifications inside ggrepel.
I'm also currently working on a new branch called point.size that allows the user to specify the size of each data point. (Right now all points are assumed to be equal size.) Maybe this new feature will help with our goal in this issue?
Here's my unsuccessful attempt:
library(ggrepel)
#> Loading required package: ggplot2
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
set.seed(1231)
d <- data.frame(
x = rnorm(30), y = rnorm(30), group = letters[1:3], stringsAsFactors = FALSE
)
d$group <- factor(d$group, levels = c("a", "b", "c", ""))
d_mean <- d %>%
dplyr::group_by(group) %>%
dplyr::summarise(
x = mean(x),
y = mean(y)
)
d_text <- rbind(
d %>% mutate(group = ""),
d_mean
)
d_text$group <- factor(d_text$group, levels = c("a", "b", "c", ""))
ggplot() +
geom_point(
data = d,
mapping = aes(x, y, color = group)
) +
geom_label_repel(
data = d_text,
mapping = aes(x, y, color = group, label = group),
size = 6,
max.iter = 1e5,
min.segment.length = 0,
# box.padding = unit(1, "lines"),
# point.padding = unit(1, "lines"),
seed = 1231
) +
theme_gray(base_size = 20)
#> Warning: Removed 1 rows containing missing values (geom_label_repel).
Created on 2019-01-21 by the reprex package (v0.2.1)
The original algorithm was envisioned for the first situation you mention. directlabels is another candidate for where to put it (it would be another positioning method). The problem there is that it doesn't check for overlap (as far as I know).
I guess the issue is that ggrepel is meant for labeling datapoints but the algo is for labeling groups, which is closer to what directlabels tries to do. If you think directlabels is a better fit I'll go knocking the door there :rofl: .
The points.size idea seems unrelated, but really cool. It would be neat to have "phantom" points that repel labels but are not labeled themselves.
The "phantom points" feature already works, just set the label to ""
. Here is an example in the vignette:
https://cran.r-project.org/web/packages/ggrepel/vignettes/ggrepel.html#hide-some-of-the-labels
Here's what I get with directlabels:
library(ggplot2)
library(directlabels)
ggplot(iris, aes(Petal.Length, Sepal.Length)) +
geom_point(aes(color = Species)) +
geom_dl(aes(color = Species, label = Species), method = "smart.grid")
set.seed(1231)
d <- data.frame(
x = rnorm(30), y = rnorm(30), group = letters[1:3], stringsAsFactors = FALSE
)
ggplot(d, aes(x, y, color = group)) +
geom_point() +
geom_dl(aes(label = group), method = "smart.grid")
Created on 2019-01-21 by the reprex package (v0.2.1)
The problem there is that it doesn't check for overlap (as far as I know).
This is not true. Here's what I get when I resize:
In summary...
I still think that taking advantage of Voronoi tessellation can potentially improve how ggrepel works. However, my ideas on this topic seem to be unrelated to your original goal.
directlabels seems to do exactly what you want. Check out the extensive list of examples. If you want to reimplement similar behavior in ggrepel, please feel free -- it's always nice to have options.
I'm actually impressed with directlabels now that I have tried it and looked at some examples. The interface and goals are a bit different than ggrepel, but it has a great range of features that ggrepel does not cover.
Ah, cool. It seems that overlap checking is backed into smart.grid
. I actually tested using position_voronoi()
as a positioning method and it overlapped with the points. smart.grid
doesn't use the same algorithm so it may still be useful to add it. But now I see that directiabels is a better fit, I'll look into that.
The phantom points example gave me the idea of applying the voronoi algorithm as a filter to remove labels. Something like this.
library(ggplot2)
library(ggrepel)
set.seed(42)
is.voronoi_max <- function(x, y, group) {
x <- scales::rescale(x, c(0, 1))
y <- scales::rescale(y, c(0, 1))
rw <- c(range(x), range(y))
del <- deldir::deldir(x, y, z = group, rw = rw,
suppressMsge = TRUE)
del_summ <- del$summary
del_summ$index <- seq_len(nrow(del_summ))
index <- vapply(split(del_summ, del_summ$z),
function(x) x$index[which.max(x$dir.area)], 1)
seq_len(length(x)) %in% index
}
ggplot(mtcars, aes(wt, mpg, color = factor(gear))) +
geom_point() +
geom_text_repel(aes(label = ifelse(is.voronoi_max(wt, mpg, gear), gear, "")))
Created on 2019-01-21 by the reprex package (v0.2.1.9000)
I have changed the title of this issue to:
New feature: Voronoi algorithm for labeling groups of data points
You never said it explicitly, but I guess this is your goal. Is that right? If not, then please change the title to clearly explain the purpose of this discussion.
Elio, thank you for sharing is.voronoi_max()
! It works surprisingly well and I like it.
However, I noticed I can get a similar result without deldir.
I'd like to hear your thoughts on these questions:
library(ggrepel)
#> Loading required package: ggplot2
seen <- function(x) {
o <- order(x)
duplicated(x[o])[order(o)]
}
ggplot(mtcars, aes(wt, mpg, color = factor(gear))) +
geom_point() +
geom_text_repel(aes(label = ifelse(!seen(gear), gear, "")))
ggplot(iris, aes(Petal.Length, Sepal.Length, color = Species)) +
geom_point() +
geom_text_repel(aes(
label = ifelse(!seen(Species), as.character(Species), "")
))
Created on 2019-01-24 by the reprex package (v0.2.1)
That's a much better title.
I'm don't understand exactly what seen()
does. It labels the first point from each group?
It will probably will work as fine as selecting a random point. Seems suboptimal when the first points are very close or identical.
library(ggplot2)
library(ggrepel)
seen <- function(x) {
o <- order(x)
duplicated(x[o])[order(o)]
}
is.voronoi_max <- function(x, y, group) {
x <- scales::rescale(x, c(0, 1))
y <- scales::rescale(y, c(0, 1))
rw <- c(range(x), range(y))
del <- deldir::deldir(x, y, z = group, rw = rw,
suppressMsge = TRUE)
del_summ <- del$summary
del_summ$index <- seq_len(nrow(del_summ))
index <- vapply(split(del_summ, del_summ$z),
function(x) x$index[which.max(x$dir.area)], 1)
seq_len(length(x)) %in% index
}
set.seed(42)
N <- 50
t <- 1:N
df <- data.frame(t = t,
y = c(cumsum(rnorm(N))),
x = c(cumsum(rnorm(N))))
df <- reshape2::melt(df, "t")
ggplot(df, aes(t, value)) +
geom_point(aes(color = variable)) +
geom_text_repel(aes(color = variable,
label = ifelse(!seen(variable), variable, ""))) +
geom_label_repel(aes(color = variable,
label = ifelse(is.voronoi_max(t, value, variable),
variable, "")))
Created on 2019-01-25 by the reprex package (v0.2.1.9000)
In light of the discussion of directlabels, I'm convinced that labeling of groups of data points is not necessary the role of ggrepel. I believe a better framing would be how to filter which datapoints to label, like the low density points issue you linked to.
Hello all, Thanks for the useful examples. How do I add an indicator line to link the center (or most densified region) of a group with the group label like ggrepel? In many real world cases, e.g., there are more than 20 groups and some may not be clear which label is which without labeling.
Thanks,
Jianshu
@jianshu93
Hi Jianshu,
Consider using two data frames, the first one for the points and the second one for the centroids.
You can compute the centroids however you want (mean, median, etc.).
Here is an example:
library(ggplot2)
library(magrittr)
library(dplyr)
df1 <- mtcars
df2 <- mtcars %>% group_by(gear) %>% summarize(mpg = mean(mpg), wt = mean(wt))
ggplot(df1) +
aes(wt, mpg, color = factor(gear)) +
geom_point(size = 2) +
geom_text_repel(data = df2, aes(label = gear), size = 8, min.segment.length = 0, nudge_x = 1) +
theme_gray(base_size = 20)
Hello @slowkow ,
Thank you for the quick response. I have some example to share and perhaps we can determine which is the best way to label complicated data:
library(ggplot2) library(RColorBrewer)
is.voronoi_max <- function(x, y, group) { x <- scales::rescale(x, c(0, 1)) y <- scales::rescale(y, c(0, 1)) rw <- c(range(x), range(y)) del <- deldir::deldir(x, y, z = group, rw = rw, suppressMsge = TRUE) del_summ <- del$summary del_summ$index <- seq_len(nrow(del_summ))
index <- vapply(split(del_summ, del_summ$z), function(x) x$index[which.max(x$dir.area)], 1) seq_len(length(x)) %in% index }
embed_elegan <- read.table("C_elegan_embedded.csv",sep=",", head=T) head(embed_elegan) library(ggplot2)
good.shapes = c(1:25,33:127) embed_elegan$embryo.time.new = as.factor(embed_elegan$embryo.time.new)
df2 <- embed_elegan %>% group_by(cell.type) %>% summarize(annembed_1 = median(annembed_1), annembed_2 = median(annembed_2))
a = ggplot(data=embed_elegan,aes(x=annembed_1,y=annembed_2, shape=cell.type))+geom_point(aes(color=embryo.time.new,shape=cell.type),size=0.005, alpha=0.5) + scale_color_brewer(palette="Paired") + scale_shape_manual(values=good.shapes[1:37])
a + theme_bw() + xlab("annembed_1") + ylab("annembed_2") + theme(legend.position="none") + geom_dl(aes(label = cell.type), method = "smart.grid")
a + geom_text_repel(data = df2, aes(label = cell.type), size = 3, min.segment.length = 0, nudge_x = 1) + theme_bw() + xlab("annembed_1") + ylab("annembed_2") + theme(legend.position="none")
a + geom_text_repel(aes(label = ifelse(is.voronoi_max(annembed_1, annembed_2, cell.type), cell.type, "")), max.overlaps=Inf) + theme_bw() + xlab("annembed_1") + ylab("annembed_2") + theme(legend.position="none")
It seems the dl example is good but no indicator line, voronoi has some problem when one category of data (in my case is the shape) is not well clustered but can be in several places. Your example seems to be perfect (I use median).
Thanks,
Jianshu
I am also think about 75% confidence interval ellipse centroid (consider both x and y) to label, which can also be useful.
Thanks,
Jianshu
The precomputation can be avoided using stat_centroid()
from package 'ggpp'. Here I use the default mean()
but it works similarly to stat_summary()
but applies the function to both x and y in parallel by group.
library(ggplot2)
library(ggpp)
#> Registered S3 methods overwritten by 'ggpp':
#> method from
#> heightDetails.titleGrob ggplot2
#> widthDetails.titleGrob ggplot2
#>
#> Attaching package: 'ggpp'
#> The following object is masked from 'package:ggplot2':
#>
#> annotate
library(ggrepel)
df1 <- mtcars
# with repulsion
ggplot(df1) +
aes(wt, mpg, color = factor(gear), label = gear) +
geom_point(size = 2) +
stat_centroid(geom = "text_repel", position = position_nudge_keep(x = 1),
size = 8, min.segment.length = 0) +
theme_gray(base_size = 20)
# no repulsion
ggplot(df1) +
aes(wt, mpg, color = factor(gear), label = gear) +
geom_point(size = 2) +
stat_centroid(geom = "text_s", position = position_nudge_keep(x = 1),
size = 8, min.segment.length = 0) +
theme_gray(base_size = 20)
Created on 2023-11-09 with reprex v2.0.2
Following up from this isssue, would you be open to a PR that implements this algorithm for automatically placing labels?