r-spatial / classInt

Choose Univariate Class Intervals
https://r-spatial.github.io/classInt/
33 stars 9 forks source link

Problems with jenks #7

Closed marcbeldata closed 5 years ago

marcbeldata commented 5 years ago

Hi,

There are several places pointing out this problem.

In classIntervals() using "jenks" style consumes a lot of time (several minutes), compare to the other intervals in the package, wich are immediate. Compare:

classIntervals(values(your_data, n = 5, style = "pretty")

classIntervals(values(your_data, n = 5, style = "jenks")

This is a very common interval in gis, and it's intantaneous in QGIS, ArcGIS or CARTO, for example.

I'm using MacOS and RStudio, last versions.

Thanks.

rsbivand commented 5 years ago

Which other places (always give references with links)? You also need to motivate this choice of interval, other than claiming that it is very common.

Your code is not a reproducible example, but should be. Do you know the Jenks code in ArcGIS, etc. - if so, could you guess why (they sample for large n)? This package provides "jenks" written only in R, as a proof of concept with two for loops (you did read the code), but "fisher" does much the same through two for loops. Neither of them sample, and both will perform poorly where there are no natural breaks, because much more work is needed to find the desired number of classes for obvious reasons.

Is your actual concern that your choice of style argument (possibly through sf::plot() or in tmap is sub-optimal? For data that can be visualised, typically n is not large, so the problem does not arise. If n is large, probably many entities will take less than an on-screen pixel anyway, unless you are using interactive mode in tmap, but I'm guessing here. You do not even say how long your variable is, so n is unknown.

I see:

> library(classInt)
> n <- 1e+06
> set.seed(1)
> x <- runif(n)
> system.time(ci <- classIntervals(x, n=10, style="pretty"))
   user  system elapsed 
  0.069   0.015   0.084 
> system.time(ci <- classIntervals(x, n=10, style="jenks"))
^C
Timing stopped at: 1942 0.947 1955

so, yes, for large n and a uniform variable, "jenks" is not a good idea (nor really is "fisher"). So what do you suggest doing? Anything from the sf and tmap maintainers - @edzer @mtennekes ? Should we block or subsample, or assume that the user will know not to use non-responsive styles for large n (and/or data without natural breaks)?

Is "kmeans" a reasonable fallback?

   user  system elapsed 
  0.913   0.019   0.936 
Warning message:
Quick-TRANSfer stage steps exceeded maximum (= 50000000) 
> ci
style: kmeans
[1.548324e-07,0.1010332)    [0.1010332,0.2020559)    [0.2020559,0.3028755) 
                  100996                   101368                   100628 
   [0.3028755,0.4033903)     [0.4033903,0.503387)     [0.503387,0.6033362) 
                  100200                    99904                   100153 
   [0.6033362,0.7030424)    [0.7030424,0.8023224)    [0.8023224,0.9012859) 
                  100071                    99142                    99292 
   [0.9012859,0.9999984] 
                   98246 
marcbeldata commented 5 years ago

References: in Stack Overflow with a vector of 500 samples

https://stackoverflow.com/questions/5304057/partition-into-classes-jenks-vs-kmeans

I'm using jenks normally in vectors between 1,000 and 20,000 data with no problems in gis software. I have 20 years experience working on GIS, first as an environtmental researcher, currently producing maps for mass media with +1M audience:

https://www.vilaweb.cat/noticies/mapa-resultats-eleccions-27-s-municipis-carrer-per-carrer/

I'mt trying to integrate R in my gis pipe. I'm saying jenks is very common because is generally available in GIS software, I've seen a lot of maps using it and, in my experience, it represents better the differences in data for general audiences. It's not about mathematics, is about test and error in data visualization. I always test several styles and choose the best for the targeted audience (general vs specialist) and data.

Unfortunately ArcGIS is closed code. QGIS is open source, but I'm not a software developer. In QGIS Anita Grasser (TW: @underdarkGIS) could help. I can't provide any direct help, sorry.

I normally use n = 4 or 5, ocassionaly up to 10 intervals. In this example (source):

prop_by_age <- readRDS("03-proportion-by-age.rds")

start_time <- Sys.time()
classIntervals(values(prop_by_age[["age_18_24"]]), n = 5, style = "pretty")

style: pretty
  [0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8)   [0.8,1] 
   130770      1775       302       170       138 

end_time <- Sys.time()

end_time - start_time
Time difference of 0.05692697 secs

start_time <- Sys.time()
classIntervals(values(prop_by_age[["age_18_24"]]), n = 5, style = "jenks")
end_time <- Sys.time()
end_time - start_time

With "jenks" I stopped after 1 hour with no output.

Thanks for your explanation. As a proof of concept, then I understand the issue.

rsbivand commented 5 years ago

It seems that QGIS implements "jenks" based on classInt, but inserts a cutoff at n > 3000, using a sampling approach from there:

https://github.com/qgis/QGIS/blob/530397c1683cc787ef90bc27e2d958779ed0d754/src/core/symbology/qgsgraduatedsymbolrenderer.cpp#L720

which refers to: https://stat.ethz.ch/pipermail/r-sig-geo/2006-March/000811.html.

The implementation takes

sample at least maximumSize values or a 10% sample, whichever is larger

but is then criticised for giving different breaks for n > 3000 because of sampling, see for example https://gis.stackexchange.com/questions/311785/qgis-gratuated-symbology-gives-differents-results-on-every-classification-with-n which helped me find the circularity. The timings are:

> n <- 3000
> set.seed(1)
> x <- runif(n)
> system.time(ci <- classIntervals(x, n=10, style="kmeans"))
   user  system elapsed 
  0.004   0.000   0.004 
> system.time(ci <- classIntervals(x, n=10, style="jenks"))
   user  system elapsed 
  8.548   0.001   8.570 
> system.time(ci <- classIntervals(x, n=10, style="fisher"))
   user  system elapsed 
  0.043   0.000   0.044 

with the "fisher" timings for n=3000 OK, I think. For n ~ 20K, QGIS samples 3000 first (3000 > 20K*0.1), but will still degrade for large 10% values. Should we optionally/default modify "fisher" to use the same size sample as QGIS does? Should we (silently?) use "fisher" when the user requests "jenks"? See also the comments in the "jenks" code on undeclared interval closure in the implementations.

See also this useful link (data size I think not mentioned): http://www.irlogi.ie/wp-content/uploads/2016/11/NUIM_ChoroHarmful.pdf

rsbivand commented 5 years ago

This https://github.com/r-spatial/classInt/commit/566be0e929b259555d5d6d29bae4ab9b6525df6e introduces QGIS-like sampling, but permits the user to pass through the sampling limit (3000L default), and the sample proportion used (0.1 default), for "fisher" and "jenks". When using "jenks" with n > sampling limit, the user is also advised to prefer "fisher". The commit also prepares the saved test output for R 3.6, so classInt now depends on R >= 3.6. The R >= 3.6 dependence is the cause of the Travis failure.

edzer commented 5 years ago

I strongly advice against making classInt depend on R >= 3.6; many OSX or windows users can't control the R version on their machine, and would not be able to install binaries of classInt, and by that neither of packages that depend on it (sf and stars come to mind).

I also got the email from BDR this morning; a more user friendly way of dealing with it is to have the tests that depend on sample use the pre-3.6 RNG.

edzer commented 5 years ago

suppressWarnings(RNGversion("3.5.3")) seems to be the incantation; see here and look for sfc.R (which tests sf::st_sample).

rsbivand commented 5 years ago

I've reverted the dependency on R 3.6 for classInt, using set.seed(..., sample.kind="Rounding") in the test when R > 3.5.3 (https://github.com/r-spatial/classInt/commit/ea555be5b8d916865f7de86bec65fa31b10e07e4). I'm assuming there will not be an R 3.5.4. I accept that getting everyone over to 3.6 will take time. Now passes Travis.

Is the change to use sampling for large n OK?

edzer commented 5 years ago

Is the change to use sampling for large n OK?

I think it makes sense.

rsbivand commented 5 years ago

@marcbeldata could you please check using remotes::install_github("r-spatial/classInt") on your workflows, preferring "fisher" to "jenks"? I need to submit to CRAN this week for other reasons.

marcbeldata commented 5 years ago

It's working for me:

jenks

I'll test in more datasets during the following days or next week.

Thanks Roger.

rsbivand commented 5 years ago

OK, thanks! I'm looking to submit on Friday, as the R integer sampler default changes in R 3.6, and affected packages have been asked to adapt to this change to avoid test output changing by mid-May latest. R 3.6.0 is coming on Friday, hence the rush.

If you need to control sampling affecting intervals, you can use set.seed(...) before calling classIntervals(), as you might with "bclust" or "kmeans" styles. I've only applied sampling to "jenks" and "fisher".

rsbivand commented 5 years ago

classInt 0.3-3 submitted to CRAN 2019-04-26 10:19.

rsbivand commented 5 years ago

classInt_0.3-3.tar.gz is on its way to CRAN 2019-04-26 10:33.

edzer commented 5 years ago

The warning for large N now also happens e.g. when breaks = equal, where - I assume - no sampling takes place. Is that intentional?

rsbivand commented 5 years ago

Fix pushed, could you check?

edzer commented 5 years ago

Works:

> classInt::classIntervals(1:1e7, 10, "equal")
style: equal
      [1,1000001) [1000001,2000001) [2000001,3000001) [3000001,4000001) 
          1000000           1000000           1000000           1000000 
  [4000001,5e+06)     [5e+06,6e+06)     [6e+06,7e+06)     [7e+06,8e+06) 
          1000000           1000000           1000000           1000000 
    [8e+06,9e+06)     [9e+06,1e+07] 
          1000000           1000000 
rsbivand commented 5 years ago

Do you need an early release?

edzer commented 5 years ago

I had in mind (changed code, but didn't push yet) to hard-code suppressing the warning, always, in stars::plot.stars and sf::plot.sf

edzer commented 5 years ago

I will revert that, since the default methods (sf: equal, stars: quantile) now don't warn in case of large maps. No early release needed.

rsbivand commented 5 years ago

Let me know when you need it (before next stars or sf release?).

edzer commented 5 years ago

Will do!