rstudio / leaflet

R Interface to Leaflet Maps
http://rstudio.github.io/leaflet/
Other
810 stars 506 forks source link

colorQuantile function requires refining to decide quantiles when data is not well distributed #94

Open yyz1989 opened 9 years ago

yyz1989 commented 9 years ago

Hi there,

First of all thank you very much for the efforts to develop leaflet for R!

Recently I encountered an issue when trying to decide quantiles for data when coloring the map. The data is not well distributed so it cannot even select 9 quantiles in 306 values. Here is the code.

> vals
  [1]   2   0   2 127   2   3   1   8   7   3   4  16   3   5   4   6   1   3   5   1   1   4   4   1  11   1   0   1   3   2   3   7  14   2   9  16   1   6   5   4   3   5  10   8   7   4   0
 [48]  11   5   4   7   3   2   8   1   3   4   3   3  21  11   6   2  10   5   5   2   2   5   2   5  10  27   0   5   9   6   6   5   4  20   7   4   6   8   3   5  20   4   2   8   3   7  10
 [95]   1   8   7   2   1   9   2   5 137   4   3   0   3   5   6   7   2  17   4   4   4   4   3   2   0   2   5   1   0   3   1   2   0  10  12   3   7   5   0  11   5   4   2   2   1   0   4
[142]   6   2   3   2   7   6   2   1   5   1   9   2   0   4   0   7   3   4   4   0   0   1   0   5   8   1   2   4  10  12  15   2   5   3   2   3   1   5   0   7  67   1   2   5   1   2   0
[189]   3   6   1   7   2   1   4   2   2   8   0   3   3   7   3  16   8   6  18   2  20   5   3  28   2   6   5   5   6   1  10   2  18   7   5   7   8   6   6   7   7  10   4  17   6  12   5
[236]   5   4   4   3   6   8   2   2   0   7   6   4   3   4   4   6   3  39   4   2   1   3   2   3   1   0   6   6   0   4   1  14  13   0  25   2   4   5   5  11   2   2   0  17   6   2   3
[283]   5   7   1   4   3   2  19   2   2   6   1   1   2   1   2   1   1   4   3   0   4   3
> pal <- colorQuantile("YlOrRd", NULL, n = 9)
> pal(vals)
Error in cut.default(x, binsToUse, labels = FALSE, include.lowest = TRUE,  : 
  'breaks' are not unique

Then I looked into the source code of colorQuantile function and tried to debug it step by step.

> probs <- seq(0, 1, length.out = 10)
> probs
 [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667 0.7777778 0.8888889 1.0000000
> bins <- quantile(vals, probs, na.rm = TRUE, names = FALSE)
> bins
 [1]   0   1   2   2   3   4   5   7  10 137

So this is the reason! There are 2 same quantiles causing the problem in cut. If I decrease the number of quantile to 8, then there won't be any problem. However, based on different filter criteria, query results cannot be predicted so it is difficult to predefine the number of quantiles. So what I am doing is as follows:

      probs <- seq(0, 1, length.out = quantileNum + 1)
      bins <- quantile(newObsVal, probs, na.rm = TRUE, names = FALSE)
      while (length(unique(bins)) != length(bins)) {
        quantileNum <- quantileNum - 1
        probs <- seq(0, 1, length.out = quantileNum + 1)
        bins <- quantile(newObsVal, probs, na.rm = TRUE, names = FALSE)
      }

It looks quite ugly and may lead to performance issues, but it works. After some investigation, I also discovered that it seems we can also use .bincode to categorize data:

> bins
 [1]   0   1   2   2   3   4   5   7  10 137
> .bincode(vals, bins, include.lowest = TRUE, right = FALSE)
  [1] 4 1 4 9 4 5 2 8 8 5 6 9 5 7 6 7 2 5 7 2 2 6 6 2 9 2 1 2 5 4 5 8 9 4 8 9 2 7 7 6 5 7 9 8 8 6 1 9 7 6 8 5 4 8 2 5 6 5 5 9 9 7 4 9 7 7 4 4 7 4 7 9 9 1 7 8 7 7 7 6 9 8 6 7 8 5 7 9 6 4 8 5 8 9
 [95] 2 8 8 4 2 8 4 7 9 6 5 1 5 7 7 8 4 9 6 6 6 6 5 4 1 4 7 2 1 5 2 4 1 9 9 5 8 7 1 9 7 6 4 4 2 1 6 7 4 5 4 8 7 4 2 7 2 8 4 1 6 1 8 5 6 6 1 1 2 1 7 8 2 4 6 9 9 9 4 7 5 4 5 2 7 1 8 9 2 4 7 2 4 1
[189] 5 7 2 8 4 2 6 4 4 8 1 5 5 8 5 9 8 7 9 4 9 7 5 9 4 7 7 7 7 2 9 4 9 8 7 8 8 7 7 8 8 9 6 9 7 9 7 7 6 6 5 7 8 4 4 1 8 7 6 5 6 6 7 5 9 6 4 2 5 4 5 2 1 7 7 1 6 2 9 9 1 9 4 6 7 7 9 4 4 1 9 7 4 5
[283] 7 8 2 6 5 4 9 4 4 7 2 2 4 2 4 2 2 6 5 1 6 5

So maybe we can replace cut with .bincode? However I don't know if there is any other side-effect. If we should not, what can I do to get around this issue?

Thank you very much and have a nice day.

Regards, Yang

jcheng5 commented 9 years ago

If a value of 2 is passed in, which bin does it belong to? It looks like the result of .bincode is to simply throw away one of the bins, I'm not sure that's a better result than an error...?

I could maybe imagine performing your loop, with a warning message...

yyz1989 commented 9 years ago

Hi @jcheng5

Thank you very much for your reply. Yeah, using .bincode will keep the problem silent, but I am also not sure about the consequence :smile:. If there is no other way to get around this issue, I also agree to throw out a warning message and try less breaks because this issue is really difficult to detect. It works for 99% of the time but only fails in really exceptional case. I guess at least it should be better to avoid throwing out an error and terminating the program when the problem is not critical. How do you think about it?

Thank you very much and have a nice day.

Regards, Yang

gregmacfarlane commented 6 years ago

I'd like this feature as well. I think if breaks are not unique, I'd be happy to re-allocate on the fly rather than stop the program or try to program around every edge case. I mostly just want to put colors on my map.

richpauloo commented 6 years ago

I'm also running into this issue, and while changing the number of breaks is a workaround, I just wanted to add another voice of support for this feature.

sdk2116 commented 6 years ago

Was this ever figured out?

deidrem commented 3 years ago

Bringing attention to this topic again! Seems to still be an issue.

tomlincr commented 3 years ago

Likewise adding my voice to this.

In the meantime I'm using colorNumeric() from the cases where this is an issue.

jumanbar commented 3 years ago

yeah, it's been a problem for me too.

velicknd commented 2 years ago

Still a problem for me as well.

knharrington commented 10 months ago

I would also like for this issue to throw out a warning instead of an error as well!

michaelpkuhn commented 6 months ago

This workaround uses getJenksBreaks from BAMMtools.

This does not fix the issue with quantile, but Jenks Breaks are a commonly used bin method in GIS programs. Therefore, I think is a reasonable alternative in the meantime. That is, it puts the number of colors that I want on the map.

factor_domain <- cut(
    tb$numericvar, 
    breaks = getJenksBreaks(tb$numericvar, n_breaks), 
    include.lowest = T
)

pal <- colorFactor(
    palette = "YlOrRd",
    domain = factor_domain
)

m <- leaflet(tb) %>%
    addProviderTiles("CartoDB") %>% 
    addPolygons(
        fillColor = ~pal(factor_domain),
        weight = 2,
        opacity = 1,
        color = "white",
        fillOpacity = 0.1
    )

m

Edit:

This is actually easier to do with colorBin.

pal <- colorBin(
    palette = "YlOrRd", 
    domain = tb$numericvar,
    bins = getJenksBreaks(tb$numericvar, 5)
)

m <- leaflet(tb) %>%
    addProviderTiles("CartoDB") %>% 
    addPolygons(fillColor = ~pal(tb$numericvar),
      weight = 2,
      opacity = 1,
      color = "white",
      fillOpacity = 0.1
    )

I also determined that this issue is likely due to the difference between quantile classification, which is popular in GIS software, and the quantile function in R which divides the range into intervals of equal probabilities.

Maybe others did not misinterpret this function, but it is how I found this thread. It might be helpful to add named bin methods to the bin function: quantile_classes (existing behavior for bins = n), jenks, etc.