teunbrand / ggh4x

ggplot extension: options for tailored facets, multiple colourscales and miscellaneous
https://teunbrand.github.io/ggh4x/
Other
564 stars 33 forks source link

Centroid stat #16

Closed teunbrand closed 4 years ago

teunbrand commented 4 years ago

How is this not a thing that already exists?

  1. Simply make a stat that takes a fun.x and fun.y argument,
  2. Per group, apply fun.x to the x-positions, apply fun.y to the y-positions.
  3. Ensure the results are of the same length or of length 1
  4. Add convenience function for centroids where fun.x = mean and fun.y = mean.
teunbrand commented 4 years ago

Essentially do this, but as a stat:

library(ggplot2)

centroid <- aggregate(iris[, 1:4], list(Species = iris$Species), mean)

ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length)) +  
  geom_point(aes(color=Species)) +
  geom_text(data = centroid, aes(label = Species))

tjebo commented 4 years ago

Hi! That sounds cool. Use case for example: https://stackoverflow.com/questions/9441436/ggplot-centered-names-on-a-map

R U working on that stat already? Shall I give it a try?

Btw, may I be nosy and ask why the package is called ggh4x?

teunbrand commented 4 years ago

Hey Tjebo,

Yeah I have something that kind of works already, but it still needs some more testing + documentation before I'm adding (and I've been busy with other things).

If you're interested to try this out here is my working script:

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.2

stat_funxy <-
  function(mapping = NULL, data = NULL, geom = "point",
           position = "identity", ..., funx = mean, funy = funx,
           argx = list(), argy = list(), na.rm = FALSE, 
           show.legend = NA,
           inherit.aes = TRUE) {
    if (!is.function(funx)) {
      stop("The `funx` argument must be a function.", call. = FALSE)
    }
    funx <- force(funx)
    if (!is.function(funy)) {
      stop("The `funy` argument must be a function.", call. = FALSE)
    }
    funy <- force(funy)

    layer(
      data = data, mapping = mapping, stat = StatFunxy, geom = geom,
      position = position, show.legend = show.legend, inherit.aes = inherit.aes,
      params = list(
        funx = funx, funy = funy, 
        argx = argx, argy = argy,
        ...
      )
    )
  }

StatFunxy <- ggproto(
  "StatFunxy", Stat,
  required_aes = c("x", "y"),
  compute_group = function(data, scales, funx, funy, argx, argy) {
    data <- as.list(data)
    x <- do.call(funx, c(unname(data["x"]), argx))
    y <- do.call(funy, c(unname(data["y"]), argy))
    other <- setdiff(names(data), c("x", "y"))
    other <- lapply(data[other], head, max(length(x), length(y)))

    data <- do.call(cbind.data.frame, c(other, x = x, y = y))
  }
)

ggplot(iris, aes(Sepal.Width, Sepal.Length, group = Species)) +
  geom_point(aes(colour = Species)) +
  stat_funxy(aes(label = Species), geom = "text")

Created on 2020-07-03 by the reprex package (v0.3.0)

The package is called ggh4x because it is 1337 5P34k (leet speak) for 'grammar of graphics hacks'. Mostly because of the multiple colour scales, panel size forcing and panel-specific position axes; these are not strictly in line with core principles of grammar of graphics (but can be useful nonetheless). The term originated in the 80's and reminded me of this youtube video, so that is where the synthwavy hex logo comes from. Also this is were I dump other stuff that I sometimes find useful while plotting (but this has nothing to do with the name).

tjebo commented 4 years ago

Love the video! I did not know about leet speak :) Will check the stat, maybe tomorrow or so

tjebo commented 4 years ago

do.call(c(unname(...), list(args)) is very clever

I'd simplify this though, because data frames are lists already. something along the lines of

compute_group = function(data, scales, funx, funy, argx, argy) {

    x <- do.call(funx, c(unname(data["x"]), argx))
    y <- do.call(funy, c(unname(data["y"]), argy))
    if(length(x)!=length(y)){
      stop("funx and funy must result in the same length", .call = FALSE)
    } 
    other <- head(data[setdiff(names(data), c("x", "y"))], length(x)) 

    data <- cbind(x,y, other)
  }

I cannot really think of a case where x and y coordinates should have different lengths... ? Do you see this case? So that's how I simplified good deal just taking the length of x from the others.

tjebo commented 4 years ago

for the previously mentioned thread this would result in

library(ggplot2)
library(maps)

stat_funxy <- etc...

StatFunxy <- ggproto(
  "StatFunxy", Stat,
  required_aes = c("x", "y"),
  compute_group = function(data, scales, funx, funy, argx, argy) {

    x <- do.call(funx, c(unname(data["x"]), argx))
    y <- do.call(funy, c(unname(data["y"]), argy))
    if(length(x)!=length(y)){
      stop("funx and funy must result in the same length", .call = FALSE)
    }
    other <- head(data[setdiff(names(data), c("x", "y"))], length(x))

    data <- cbind(x,y, other)
  }
)

county_df <- map_data('county')  #mappings of counties by state
ny <- subset(county_df, region=="new york")   #subset just for NYS
ny$county <- ny$subregion

ggplot(ny, aes(x = long, y = lat, group = group)) +  
  geom_polygon(colour='black', fill=NA) +
  stat_funxy(aes(label = subregion), geom = "text",size=2)+
  coord_map()


ggplot(ny, aes(x = long, y = lat, group = group)) +  
  geom_polygon(colour='black', fill=NA) +
  stat_funxy(aes(label = subregion), 
             funx = function(x) mean(range(x)), geom = "text",size=2) +
  coord_map()

Created on 2020-07-04 by the reprex package (v0.3.0)

tjebo commented 4 years ago

Because I like the idea of the centroid use case, and I think it would be much used, one could consider either mean or mean(range(x)) as main methods, maybe depending on if the center looked for are like points or polygons. Just some thoughts :)

teunbrand commented 4 years ago

Hey Tjebo,

Yeah I'll consider putting up a new answer once I finally finish up this stat. If you want to answer that question instead feel free to do so! Here are my thoughts on some of the comments.

I cannot really think of a case where x and y coordinates should have different lengths... ? Do you see this case? So that's how I simplified good deal just taking the length of x from the others.

There is the rare case that one of the coordinates has length 1 while the other has more. Imagine you would like to draw a line over some range, you can do fun.x = range and fun.y = mean. This results in length 2 x but length 1 y, which should automatically expand the length 1 coordinate to fit the longer coordinate.

Because I like the idea of the centroid use case, and I think it would be much used, one could consider either mean or mean(range(x)) as main methods, maybe depending on if the center looked for are like points or polygons. Just some thoughts :)

Yeah you make a good point that both these cases would be convenient. I guess I could make two extra constructors that both use the ggproto but give different fun.x/fun.y that are called stat_midpoint() and stat_centroid() for these common cases.

tjebo commented 4 years ago

There is the rare case that one of the coordinates has length 1 while the other has more.

I see - in this case you still need to test for the length of y to x for when the functions result in unequal lengths not including 1, I guess :) , otherwise it will throw a non-specific error that the data frame cannot be created

as for the recycling, I think this would also work for the “others” data frame / list somehow rather than using “head”, but not sure.

teunbrand commented 4 years ago

Should now be possible to do the following, e.g. make flower-like plots about flowers (@robinweide, go nuts!)

library(ggh4x)
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 4.0.2

ggplot(iris, aes(Sepal.Width, Sepal.Length, 
                 xend = Sepal.Width, yend = Sepal.Length,
                 colour = Species, group = Species)) +
  geom_point() +
  stat_centroid(crop_other = FALSE, geom = "curve") +
  stat_centroid(crop_other = FALSE, geom = "curve", curvature = -0.5) +
  stat_centroid(geom = "point", colour = "gold", size = 10)

Created on 2020-07-06 by the reprex package (v0.3.0)