r-tmap / tmap

R package for thematic maps
https://r-tmap.github.io/tmap
GNU General Public License v3.0
856 stars 119 forks source link

Improving variable to aesthetic mapping (input asked) #406

Closed mtennekes closed 4 years ago

mtennekes commented 4 years ago

tmap 3.0 will be released in a few days. For this version, I want to improve the variable mapping, so any feedback/tips is welcome.

There is a need for two features:

1. Integer variables

Treat a numeric variable as integer. This is needed because currently the legend labels will be 0 to 10, 10 to 20, 20 to 30, where the presumed intervals are [0, 10), [10, 20) and [10, 30], so open righthand-side except the last). When the variable is an integer, then the legend labels should be 0 to 9, 10 to 19, 20 to 29 (or 30).

I'm thinking about style = "integer" or an additional argument as.integer. The latter probably makes more sense since many break styles (current options are c("cat", "fixed", "sd", "equal", "pretty", "quantile", "kmeans", "hclust", "bclust", "fisher", "jenks", and "log10_pretty")) should handle integers slightly differently. For instance, "log10_pretty" will return 0 to 1, 1 to 10, 10 to 100 when the variable is continuous and should return 0, 1 to 9, 10 to 99 when it is an integer.

What do you think? If we go for the second option, what would be a good name for the argument? as.integer, as.continuous, as.discrete, ....?

Next question: should tmap set the default value to this argument to continuous, or should the default value be determined by whether all variable values are integers?

(see also https://github.com/mtennekes/tmap/issues/258 and https://github.com/mtennekes/tmap/issues/399)

2. Specific value to color mapping

Sometimes all a user (including myself) wants is to map specific data variables to specific colors. How should this be done? Keep in mind that it should work for integer and categorical data.

For categorical data, we could let the user assign a named color vector to the argument palette, where the names correspond to the levels.

How do we do this for numeric data? A color table? If so, it makes sense to add the labels in this color table as well, rather than via the labels argument. Any ideas?

(see also https://github.com/r-spatial/mapview/issues/208)

@Nowosad @Robinlovelace @sjewo @jannes-m @tim-salabim @edzer @rsbivand @mcSamuelDataSci @zross

tim-salabim commented 4 years ago

Hi @mtennekes, I have been struggling with this same issue recently as well. For mapview I think, I have it under acceptable control now. Acceptable meaning that in the scope of mapview I don't care too much about whether the legend maps [0, 1) or [0, 1]. Currently, and mostly for convenience, mapview treats all integer as numeric values and all character values as factors.

rsbivand commented 4 years ago

Tangentially, there is infrastructure in classInt to handle interval closure (intervalClosure=). On occasion, I've found that running classInt twice, first with dataPrecision=NULL, the default, then with style="fixed" and non-default dataPrecision=, or just using dataPrecision=. tmap::tm_fill() has the equivalent interval.closure= argument, but I don't see dataPrecision=.

In addition, @dieghernan has contributed a new style: "headtails" with a vignette. I'm looking to submit to CRAN soon, to make this available.

mtennekes commented 4 years ago

Thanks @tim-salabim and @rsbivand.

Currently, tmap also treats integers as numeric and character as factors, but since there were a few use cases in which the data values are clearly integers, it would be good to adjust the breaks (or at least the labels) accordingly.

The interval closure is not my main concern. It is under control: the argument legend.format contains a parameter called digits which is similar to dataPrecision in classInt. Probably would have been easier for me to use dataPrecision in the implementation. Looking forward to test this new style headtails in tmap.

sjewo commented 4 years ago

Hi @mtennekes, those are nice improvements for tmap!

1) For my use cases the new legend labels for integers are really helpful. I would prefer a additional option "as.integer" with a default value determined by the class of the variable (integer or numeric).

2) I think a named color vector would be fine for factors and numeric (or integer) variables as well. A unified approach to define a palette would be more user friendly, but I don't know if this would be too complicated for floating point numbers.

edzer commented 4 years ago

Hi @mtennekes about the integer legend: 10 years ago I would have thought "great!", now I think it is over-engineering. Does ggplot2 have this feature?

For the color ramps: stars now adopts a vector of colors mapping one-to-one with an integer variable, starting at 1 (like levels of a factor); https://github.com/r-spatial/stars/issues/128

mtennekes commented 4 years ago

Color assignment is working now. Also the colors from stars are used (I check whether there are duplicated levels and if so, apply droplevels).

library(tmap)
library(stars)
#> Loading required package: abind
#> Loading required package: sf
#> Linking to GEOS 3.8.0, GDAL 2.4.2, PROJ 5.2.0

data(World)

# palette of named colors for a character/factor variable
tm_shape(World) + tm_polygons("income_grp", 
    palette = c("2. High income: nonOECD" = "red",
        "3. Upper middle income" = "green", 
        "4. Lower middle income" = "pink", 
        "1. High income: OECD" = "blue",
        "5. Low income" = "purple"))


# palette of named colors for a numeric variable
World$income_grp_int <- as.integer(World$income_grp)
tm_shape(World) + tm_polygons("income_grp_int", style = "cat", 
    palette = c("2" = "red", 
        "3" = "green", 
        "4" = "pink", 
        "1" = "blue",
        "5" = "purple"))


# use the colors of a stars object
#getwd()
r = read_stars("pr_landcover_wimperv_10-28-08_se5.img", 
    RAT = "Land Cover Class", proxy = TRUE)
# downloaded from https://s3-us-west-2.amazonaws.com/mrlc/PR_landcover_wimperv_10-28-08_se5.zip

qtm(r) + tm_legend(outside = TRUE)

image

Nowosad commented 4 years ago

@mtennekes, thank you for opening this discussion.

1. Integer variables

I think it would be a nice addition to tmap, but it is not crucial. It depends on the effort you would make to add this feature. An as.integer argument sounds fine.

2. Specific value to color mapping

This is, in my opinion, a way more interesting and important feature. I already started this discussion at https://github.com/mtennekes/tmap/issues/276 and at https://github.com/mtennekes/tmap/issues/388.

It would be also great to make it possible to extend the color mapping to external symbologies (see https://github.com/mtennekes/tmap/issues/65 and https://github.com/r-spatial/discuss/issues/36).

Update: The above examples look great! I have some questions about the last examples - does it drop empty levels by default? It is possible to not drop them? How can someone edit the legend there (one category does not have a name)?

mtennekes commented 4 years ago

Good point @Nowosad !

Hmm, why isn't there an argument to specify whether unused levels are dropped (@mtennekes?)

That specific file is crappy: I think it doesn't contain unused levels, but duplicated levels. Also the black-colored category has level "". It is not easy to change the legend afterwards. Much easier is to replace all the "" values with NA, and set colorNA = "black".

Nowosad commented 4 years ago

You can find some examples with unused levels at https://github.com/r-spatial/stars/issues/245#issuecomment-601609490.

edzer commented 4 years ago

droplevels drops unused factor levels. I wouldn't do that automatically: if you plot time series of factor maps, at some times certain levels may not be present but you'd still want them in the legend.

Nowosad commented 4 years ago

I agree @edzer, but I think there should be an argument in tmap invoking droplevels. It could be FALSE by default.

mtennekes commented 4 years ago

Exactly what I'm working on: an argument drop.levels which is by default FALSE.

And I'll add an argument as.integer which formats the labels as integers (so 0 to 9, 10 to 19 etc). For know, I'll only do this for style = "pretty" and "log10_pretty", which should be sufficient.

Thanks for your input!

zross commented 4 years ago

This is totally great! I provided a bit of code for reference

  1. I'm going to disagree with Edzer about the over-engineering. I actually think the legend-integer issue is very important. As it stands, the tmap for integer literally does not make sense since you can't tell whether a given integer on the margins falls into one category or another. Really important -- and I like your solution.

  2. I don't have an opinion on the 2nd issue beyond what has already been supplied.

library(sf)
library(tmap)
library(dplyr)

counties <- read_sf("https://cdn.jsdelivr.net/npm/us-atlas@3/counties-10m.json") %>% 
  filter(stringr::str_sub(id,1,2) == "36")

n <- nrow(counties)
set.seed(100)
counties <- counties %>% 
  mutate(
    vals_int = sample(1:10, n, replace = TRUE),
    vals_cont = rnorm(n)
  )

tm_shape(counties) + 
  tm_polygons("vals_int", style = "pretty")

tm_shape(counties) + 
  tm_polygons("vals_cont")

image

image

mtennekes commented 4 years ago

That's a very nice example @zross. It illustrates another problem:

pretty(runif(100, min = 0, max = 10))
#> [1]  0  2  4  6  8 10
pretty(1L:10L)
#> [1]  0  2  4  6  8 10

When I opened this issue, I thought that changing the labels at the righthand-side of the intervals would be enough (e.g. from 0-10, 10-20 to 0-9, 10-19, etc). However, in this case it would make more sense to have 1-2, 3-4, 5-6, 7-8, 9-10 (given n=5). So pretty is not very useful here.

Any ideas how to tackle this problem? @rsbivand does classInt offer a method for this?

rsbivand commented 4 years ago

No, pretty() expects that x= is a continuous variable. classIntervals(x, n=5, style="pretty", intervalClosure="right") gives the classes, but not the break labels.

mtennekes commented 4 years ago
data(World)

# as.count is TRUE for integers if style = pretty, fixed, or log10_pretty

# N (natural numbers, with 0)
World$x <- sample(0:20, size = 177, replace = TRUE)
tm_shape(World) + tm_polygons("x")


# N+ (natural numbers, positive)
World$x <- sample(1:20, size = 177, replace = TRUE)
tm_shape(World) + tm_polygons("x")


# Z (integers)
World$x <- sample(-10:10, size = 177, replace = TRUE)
tm_shape(World) + tm_polygons("x")
#> Variable(s) "x" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.


# show as continuous (old way)
World$x <- sample(1:20, size = 177, replace = TRUE)
tm_shape(World) + tm_polygons("x", as.count = FALSE)


# style: fixed
tm_shape(World) + tm_polygons("x", breaks = c(1, 5, 10, 20))


# scientific notation (decided to use the set notation)
tm_shape(World) + tm_polygons("x", breaks = c(0, 1, 3, 5, 10, 20), 
   legend.format = list(scientific = TRUE))


# style: log10pretty (continuous)
tm_shape(World) + tm_polygons("pop_est", style = "log10_pretty")


# style: log10pretty (count)
tm_shape(World) + tm_polygons("pop_est", as.count = TRUE, style = "log10_pretty")

Created on 2020-04-07 by the reprex package (v0.3.0.9001)

mcSamuelDataSci commented 4 years ago

Thank you Martijn, both these enhancements are very helpful for me, exactly as you are implementing them!

On Sun, Apr 5, 2020 at 1:45 AM mtennekes notifications@github.com wrote:

tmap 3.0 will be released in a few days. For this version, I want to improve the variable mapping, so any feedback/tips is welcome.

There is a need for two features:

1. Integer variables

Treat a numeric variable as integer. This is needed because currently the legend labels will be 0 to 10, 10 to 20, 20 to 30, where the presumed intervals are [0, 10), [10, 20) and [10, 30], so open righthand-side except the last). When the variable is an integer, then the legend labels should be 0 to 9, 10 to 19, 20 to 29 (or 30).

I'm thinking about style = "integer" or an additional argument as.integer. The latter probably makes more sense since many break styles (current options are c("cat", "fixed", "sd", "equal", "pretty", "quantile", "kmeans", "hclust", "bclust", "fisher", "jenks", and "log10_pretty")) should handle integers slightly differently. For instance, "log10_pretty" will return 0 to 1, 1 to 10, 10 to 100 when the variable is continuous and should return 0, 1 to 9, 10 to 99 when it is an integer.

What do you think? If we go for the second option, what would be a good name for the argument? as.integer, as.continuous, as.discrete, ....?

Next question: should tmap set the default value to this argument to continuous, or should the default value be determined by whether all variable values are integers?

(see also #258 https://github.com/mtennekes/tmap/issues/258 and #399 https://github.com/mtennekes/tmap/issues/399)

2. Specific value to color mapping

Sometimes all a user (including myself) wants is to map specific data variables to specific colors. How should this be done? Keep in mind that it should work for integer and categorical data.

For categorical data, we could let the user assign a named color vector to the argument palette, where the names correspond to the levels.

How do we do this for numeric data? A color table? If so, it makes sense to add the labels in this color table as well, rather than via the labels argument. Any ideas?

(see also r-spatial/mapview#208 https://github.com/r-spatial/mapview/issues/208)

@Nowosad https://github.com/Nowosad @Robinlovelace https://github.com/Robinlovelace @sjewo https://github.com/sjewo @jannes-m https://github.com/jannes-m @tim-salabim https://github.com/tim-salabim @edzer https://github.com/edzer @rsbivand https://github.com/rsbivand @mcSamuelDataSci https://github.com/mcSamuelDataSci @zev https://github.com/zev @zross https://github.com/zross

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mtennekes/tmap/issues/406, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYFE6BJYNQ72KUL7ZXVCODRLBAKHANCNFSM4MANICWA .

mcSamuelDataSci commented 4 years ago

Wonderful!!

On Tue, Apr 7, 2020 at 11:49 AM mtennekes notifications@github.com wrote:

data(World)

as.count is TRUE for integers if style = pretty, fixed, or log10_pretty

N (natural numbers, with 0)World$x <- sample(0:20, size = 177, replace = TRUE)

tm_shape(World) + tm_polygons("x")

https://camo.githubusercontent.com/72e3f79059ea5be1d2200883318f0706af2f03ac/68747470733a2f2f692e696d6775722e636f6d2f55615a634b6d722e706e67

N+ (natural numbers, positive)World$x <- sample(1:20, size = 177, replace = TRUE)

tm_shape(World) + tm_polygons("x")

https://camo.githubusercontent.com/246beb8516e3f86e25caf3093366338e7f98deed/68747470733a2f2f692e696d6775722e636f6d2f323956623651512e706e67

Z (integers)World$x <- sample(-10:10, size = 177, replace = TRUE)

tm_shape(World) + tm_polygons("x")#> Variable(s) "x" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

https://camo.githubusercontent.com/fc5c89ae114db075db7513f92e6801551df9e7f7/68747470733a2f2f692e696d6775722e636f6d2f36536c5670694a2e706e67

show as continuous (old way)World$x <- sample(1:20, size = 177, replace = TRUE)

tm_shape(World) + tm_polygons("x", as.count = FALSE)

https://camo.githubusercontent.com/c49e3b5ef266a682e4a9aae3cdb95942f1e820d9/68747470733a2f2f692e696d6775722e636f6d2f4c4d39696b346c2e706e67

style: fixed

tm_shape(World) + tm_polygons("x", breaks = c(1, 5, 10, 20))

https://camo.githubusercontent.com/47c28448c85459952cc57b60f30b6e38652635d6/68747470733a2f2f692e696d6775722e636f6d2f3841435a7464712e706e67

scientific notation (decided to use the set notation)

tm_shape(World) + tm_polygons("x", breaks = c(0, 1, 3, 5, 10, 20), legend.format = list(scientific = TRUE))

https://camo.githubusercontent.com/6e06e35c2f02dd8796b503da181195a9904150e8/68747470733a2f2f692e696d6775722e636f6d2f437436614331582e706e67

style: log10pretty (continuous)

tm_shape(World) + tm_polygons("pop_est", style = "log10_pretty")

https://camo.githubusercontent.com/58c43db3a1faaf9d16a479c136f09046db87b843/68747470733a2f2f692e696d6775722e636f6d2f727042747162692e706e67

style: log10pretty (count)

tm_shape(World) + tm_polygons("pop_est", as.count = TRUE, style = "log10_pretty")

https://camo.githubusercontent.com/ae4d39f43001976845b1dae6391d904bee1ef59f/68747470733a2f2f692e696d6775722e636f6d2f51506765654e522e706e67

Created on 2020-04-07 by the reprex package https://reprex.tidyverse.org (v0.3.0.9001)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mtennekes/tmap/issues/406#issuecomment-610559092, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEYFE6CJHJYLLIFFSXSLZO3RLNYU7ANCNFSM4MANICWA .

rsbivand commented 4 years ago

Re: https://github.com/mtennekes/tmap/issues/406#issuecomment-609428252 classInt 0.4-3 with headtails style on CRAN.

mtennekes commented 4 years ago

Re: #406 (comment) classInt 0.4-3 with headtails style on CRAN.

... and already supported by tmap

data(World)
tm_shape(World) + tm_symbols(col = "pop_est_dens",
    style = "headtails", style.args = list(thr = 1))

mtennekes commented 4 years ago

tmap 3.0 on its way to CRAN