kadyb / rgugik

Download datasets from Polish Head Office of Geodesy and Cartography
https://kadyb.github.io/rgugik/
Other
33 stars 4 forks source link

Remove WHERE sql arg from "orto_request"? #13

Closed kadyb closed 4 years ago

kadyb commented 4 years ago

I implemented WHERE SQL clause in orto_request. However, this is much slower than filtering on R client side.

library("rgugik")
library("sf")

polygon_path = system.file("datasets/search_area.gpkg", package = "rgugik")
polygon = read_sf(polygon_path)

# no filtering
start = Sys.time()
req_df = orto_request(polygon)
Sys.time() - start
#> Time difference of 0.326297 secs

# example 1 SQL
start = Sys.time()
req_df = orto_request(polygon, where = "kolor LIKE 'CIR'")
Sys.time() - start
#> Time difference of 4.729147 secs

# example 1 R
start = Sys.time()
req_df = orto_request(polygon)
req_df = req_df[req_df$kolor == "CIR", ]
Sys.time() - start
#> Time difference of 0.53417 secs

# example 2 SQL
start = Sys.time()
req_df = orto_request(polygon, where = "piksel <= 0.25 AND akt_rok >= 2016")
Sys.time() - start
#> Time difference of 43.1518 secs

# example 2 R
start = Sys.time()
req_df = orto_request(polygon)
req_df = req_df[req_df$piksel <= 0.25 & req_df$akt_rok >= 2016, ]
Sys.time() - start
#> Time difference of 0.384141 secs

Removing the WHERE clause will have three advantages:

  1. The user will not be able to use the slower method and probably the GUGiK server will be less overloaded.
  2. The SQL database attributes are in Polish, so an English user may have problems with that. Using filtering in R, everything will be in English.
  3. Simplification of this function.

@Nowosad what do you think?

Nowosad commented 4 years ago

Thanks. It is great that you created the benchmarks. Based on them, my opinion is:

  1. To remove the WHERE clause.
  2. To add some other arguments (e.g. year, pixel) - the orto_request() could internally filter the database based on the provided arguments.
kadyb commented 4 years ago

WHERE clause removed from orto_request in cf0c9c0 and dbc64a7.

kadyb commented 4 years ago

It's strange, but for elevation data, the SQL and R side query are equally fast (no time difference). This is probably a bug on the GUGiK services.