Closed marcbeldata closed 5 years ago
Which other places (always give references with links)? You also need to motivate this choice of interval, other than claiming that it is very common.
Your code is not a reproducible example, but should be. Do you know the Jenks code in ArcGIS, etc. - if so, could you guess why (they sample for large n)? This package provides "jenks" written only in R, as a proof of concept with two for loops (you did read the code), but "fisher" does much the same through two for loops. Neither of them sample, and both will perform poorly where there are no natural breaks, because much more work is needed to find the desired number of classes for obvious reasons.
Is your actual concern that your choice of style argument (possibly through sf::plot()
or in tmap is sub-optimal? For data that can be visualised, typically n is not large, so the problem does not arise. If n is large, probably many entities will take less than an on-screen pixel anyway, unless you are using interactive mode in tmap, but I'm guessing here. You do not even say how long your variable is, so n is unknown.
I see:
> library(classInt)
> n <- 1e+06
> set.seed(1)
> x <- runif(n)
> system.time(ci <- classIntervals(x, n=10, style="pretty"))
user system elapsed
0.069 0.015 0.084
> system.time(ci <- classIntervals(x, n=10, style="jenks"))
^C
Timing stopped at: 1942 0.947 1955
so, yes, for large n and a uniform variable, "jenks" is not a good idea (nor really is "fisher"). So what do you suggest doing? Anything from the sf and tmap maintainers - @edzer @mtennekes ? Should we block or subsample, or assume that the user will know not to use non-responsive styles for large n (and/or data without natural breaks)?
Is "kmeans" a reasonable fallback?
user system elapsed
0.913 0.019 0.936
Warning message:
Quick-TRANSfer stage steps exceeded maximum (= 50000000)
> ci
style: kmeans
[1.548324e-07,0.1010332) [0.1010332,0.2020559) [0.2020559,0.3028755)
100996 101368 100628
[0.3028755,0.4033903) [0.4033903,0.503387) [0.503387,0.6033362)
100200 99904 100153
[0.6033362,0.7030424) [0.7030424,0.8023224) [0.8023224,0.9012859)
100071 99142 99292
[0.9012859,0.9999984]
98246
References: in Stack Overflow with a vector of 500 samples
https://stackoverflow.com/questions/5304057/partition-into-classes-jenks-vs-kmeans
I'm using jenks normally in vectors between 1,000 and 20,000 data with no problems in gis software. I have 20 years experience working on GIS, first as an environtmental researcher, currently producing maps for mass media with +1M audience:
https://www.vilaweb.cat/noticies/mapa-resultats-eleccions-27-s-municipis-carrer-per-carrer/
I'mt trying to integrate R in my gis pipe. I'm saying jenks is very common because is generally available in GIS software, I've seen a lot of maps using it and, in my experience, it represents better the differences in data for general audiences. It's not about mathematics, is about test and error in data visualization. I always test several styles and choose the best for the targeted audience (general vs specialist) and data.
Unfortunately ArcGIS is closed code. QGIS is open source, but I'm not a software developer. In QGIS Anita Grasser (TW: @underdarkGIS) could help. I can't provide any direct help, sorry.
I normally use n = 4 or 5, ocassionaly up to 10 intervals. In this example (source):
prop_by_age <- readRDS("03-proportion-by-age.rds")
start_time <- Sys.time()
classIntervals(values(prop_by_age[["age_18_24"]]), n = 5, style = "pretty")
style: pretty
[0,0.2) [0.2,0.4) [0.4,0.6) [0.6,0.8) [0.8,1]
130770 1775 302 170 138
end_time <- Sys.time()
end_time - start_time
Time difference of 0.05692697 secs
start_time <- Sys.time()
classIntervals(values(prop_by_age[["age_18_24"]]), n = 5, style = "jenks")
end_time <- Sys.time()
end_time - start_time
With "jenks" I stopped after 1 hour with no output.
Thanks for your explanation. As a proof of concept, then I understand the issue.
It seems that QGIS implements "jenks" based on classInt, but inserts a cutoff at n > 3000, using a sampling approach from there:
which refers to: https://stat.ethz.ch/pipermail/r-sig-geo/2006-March/000811.html.
The implementation takes
sample at least maximumSize values or a 10% sample, whichever is larger
but is then criticised for giving different breaks for n > 3000 because of sampling, see for example https://gis.stackexchange.com/questions/311785/qgis-gratuated-symbology-gives-differents-results-on-every-classification-with-n which helped me find the circularity. The timings are:
> n <- 3000
> set.seed(1)
> x <- runif(n)
> system.time(ci <- classIntervals(x, n=10, style="kmeans"))
user system elapsed
0.004 0.000 0.004
> system.time(ci <- classIntervals(x, n=10, style="jenks"))
user system elapsed
8.548 0.001 8.570
> system.time(ci <- classIntervals(x, n=10, style="fisher"))
user system elapsed
0.043 0.000 0.044
with the "fisher" timings for n=3000 OK, I think. For n ~ 20K, QGIS samples 3000 first (3000 > 20K*0.1), but will still degrade for large 10% values. Should we optionally/default modify "fisher" to use the same size sample as QGIS does? Should we (silently?) use "fisher" when the user requests "jenks"? See also the comments in the "jenks" code on undeclared interval closure in the implementations.
See also this useful link (data size I think not mentioned): http://www.irlogi.ie/wp-content/uploads/2016/11/NUIM_ChoroHarmful.pdf
This https://github.com/r-spatial/classInt/commit/566be0e929b259555d5d6d29bae4ab9b6525df6e introduces QGIS-like sampling, but permits the user to pass through the sampling limit (3000L default), and the sample proportion used (0.1 default), for "fisher" and "jenks". When using "jenks" with n > sampling limit, the user is also advised to prefer "fisher". The commit also prepares the saved test output for R 3.6, so classInt now depends on R >= 3.6. The R >= 3.6 dependence is the cause of the Travis failure.
I strongly advice against making classInt depend on R >= 3.6; many OSX or windows users can't control the R version on their machine, and would not be able to install binaries of classInt, and by that neither of packages that depend on it (sf and stars come to mind).
I also got the email from BDR this morning; a more user friendly way of dealing with it is to have the tests that depend on sample
use the pre-3.6 RNG.
suppressWarnings(RNGversion("3.5.3"))
seems to be the incantation; see here and look for sfc.R
(which tests sf::st_sample
).
I've reverted the dependency on R 3.6 for classInt, using set.seed(..., sample.kind="Rounding")
in the test when R > 3.5.3 (https://github.com/r-spatial/classInt/commit/ea555be5b8d916865f7de86bec65fa31b10e07e4). I'm assuming there will not be an R 3.5.4. I accept that getting everyone over to 3.6 will take time. Now passes Travis.
Is the change to use sampling for large n OK?
Is the change to use sampling for large n OK?
I think it makes sense.
@marcbeldata could you please check using remotes::install_github("r-spatial/classInt") on your workflows, preferring "fisher" to "jenks"? I need to submit to CRAN this week for other reasons.
It's working for me:
I'll test in more datasets during the following days or next week.
Thanks Roger.
OK, thanks! I'm looking to submit on Friday, as the R integer sampler default changes in R 3.6, and affected packages have been asked to adapt to this change to avoid test output changing by mid-May latest. R 3.6.0 is coming on Friday, hence the rush.
If you need to control sampling affecting intervals, you can use set.seed(...)
before calling classIntervals()
, as you might with "bclust" or "kmeans" styles. I've only applied sampling to "jenks" and "fisher".
classInt 0.3-3 submitted to CRAN 2019-04-26 10:19.
classInt_0.3-3.tar.gz is on its way to CRAN 2019-04-26 10:33.
The warning for large N now also happens e.g. when breaks = equal
, where - I assume - no sampling takes place. Is that intentional?
Fix pushed, could you check?
Works:
> classInt::classIntervals(1:1e7, 10, "equal")
style: equal
[1,1000001) [1000001,2000001) [2000001,3000001) [3000001,4000001)
1000000 1000000 1000000 1000000
[4000001,5e+06) [5e+06,6e+06) [6e+06,7e+06) [7e+06,8e+06)
1000000 1000000 1000000 1000000
[8e+06,9e+06) [9e+06,1e+07]
1000000 1000000
Do you need an early release?
I had in mind (changed code, but didn't push yet) to hard-code suppressing the warning, always, in stars::plot.stars
and sf::plot.sf
I will revert that, since the default methods (sf: equal, stars: quantile) now don't warn in case of large maps. No early release needed.
Let me know when you need it (before next stars or sf release?).
Will do!
Hi,
There are several places pointing out this problem.
In classIntervals() using "jenks" style consumes a lot of time (several minutes), compare to the other intervals in the package, wich are immediate. Compare:
classIntervals(values(your_data, n = 5, style = "pretty")
classIntervals(values(your_data, n = 5, style = "jenks")
This is a very common interval in gis, and it's intantaneous in QGIS, ArcGIS or CARTO, for example.
I'm using MacOS and RStudio, last versions.
Thanks.