kadyb / rgugik

Download datasets from Polish Head Office of Geodesy and Cartography
https://kadyb.github.io/rgugik/
Other
33 stars 4 forks source link

Limits for `geocodePL_get` #41

Closed BERENZ closed 4 years ago

BERENZ commented 4 years ago

Is there a limit for the number of / time between queries for geocoding using geocodePL_get? I tried to find this information on GUGIK webpage but I failed.

kadyb commented 4 years ago

Also, I have not found such information anywhere. There are probably no such restrictions. However, in my experience, GUGiK's servers and services are problematic. I think the safe solution will be to set some interval (maybe 1 s?) between requests.

BTW: At this moment the geocodePL_get() function needs some output improvements (#11).

BERENZ commented 4 years ago

Ok, I understand. Maybe you could contact GUGIK's staff to ask about the limitations?

BTW. is it possible that geocodePL_get() may return sf object instead of list? That would be super useful for speeding up the processing and merging with other data?

kadyb commented 4 years ago

OK, I will write a message asking if there are limits on the number of requests and the time between them.

Yes. This is a very good idea. There is definitely room for improvement. Currently we don't have time to do it, but I will definitely keep it in mind in the future.

Edit: I sent email.

BERENZ commented 4 years ago

Ok, so here is small proposal that combines the result of geocodePL_get.

output <- geocodePL_get(address = "Marki")

if (sapply(output, length)[1] == 1) {
  df <- as.data.frame(do.call(cbind, test), stringsAsFactors = FALSE)
  df$geometry_wkt <- NULL
  df <- st_as_sf(x = df, coords = c("x", "y"), crs = 2180)
} else {
  df <- lapply(output, FUN = function(x) as.data.frame(do.call(cbind, x), stringsAsFactors = FALSE))
  df <- do.call('rbind',df)
  df$geometry_wkt <- NULL
  df <- st_as_sf(x = df, coords = c("x", "y"), crs = 2180)
}

Here's how it works

  1. For multiple results
> output <- geocodePL_get(address = "Marki") ## list of 10
> df[,1:5]

Simple feature collection with 10 features and 5 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 469003.1 ymin: 193553.4 xmax: 710402 ymax: 631605
CRS:            EPSG:2180
    city  teryt    simc  voivodeship            county                  geometry
1  Marki 100103 0538774      łódzkie      bełchatowski POINT (523435.6 398347.3)
2  Marki 120702 0960993  małopolskie powiat limanowski POINT (576498.3 199686.8)
3  Marki 120709 0453724  małopolskie powiat limanowski POINT (583279.3 195401.8)
4  Marki 120711 0467212  małopolskie powiat limanowski POINT (597554.8 204842.4)
5  Marki 121508 0994934  małopolskie             suski   POINT (537196 193553.4)
6  Marki 143402 0920901  mazowieckie        wołomiński POINT (644467.9 498763.1)
7  Marki 160804 0143432     opolskie     powiat oleski POINT (469003.1 358536.7)
8  Marki 160804 0143366     opolskie            oleski POINT (469243.6 358790.1)
9  Marki 182001 0787721 podkarpackie      tarnobrzeski     POINT (685860 289770)
10 Marki 200602 0397167    podlaskie         kolneński     POINT (710402 631605)
  1. For one result
> output <- geocodePL_get(address = "Marki, Andersa")
> df[,1:5]

Simple feature collection with 1 feature and 5 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 643949.4 ymin: 499656.9 xmax: 643949.4 ymax: 499656.9
CRS:            EPSG:2180
   street  teryt    simc  ulic  city                  geometry
1 Andersa 143402 0920901 00285 Marki POINT (643949.4 499656.9)
  1. Works also for other objects given in documentation
> output <- geocodePL_get(rail_crossing = "001 018 478")
> df[,1:5]

Simple feature collection with 1 feature and 4 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 620704.5 ymin: 478258.4 xmax: 620704.5 ymax: 478258.4
CRS:            EPSG:2180
          operator category            phone    mobile phone                  geometry
1 PKP PLK WARSZAWA        A +48 22 473 37 34 +48 600 084 183 POINT (620704.5 478258.4)

EDIT: if you like this proposal I may prepare PR with respect to geocodePL_get.R and test-geocodePL_get.R

EDIT2: I don't know how to use element geometry_wkt that contains sf object which may be a better idea than using coords = c("x","y").

kadyb commented 4 years ago

I looked at your code (but I didn't test it). Maybe can we simplify it?

output = geocodePL_get(address = "Marki")
df_output = do.call(rbind.data.frame, output)
# use "geometry_wkt"
df_output = sf::st_as_sf(df_output, wkt = "geometry_wkt", crs = 2180)

Also, in geocodePL_get.R, we can remove

if (length(output) == 1) {
  output = output[[1]]
}

so a nested list will always be returned, then we can drop length condition (in your code) or just use rbind.data.frame.

The question: what if any column (attribute) is empty (NULL)? Will the function even work? The next point is that we should only choose the relevant columns at the end (#11). One more thing, there will probably be some duplicate code, so we should create some helper function.

BERENZ commented 4 years ago

If you simplify then results with only one query give incorrect output, see below:

> output <- geocodePL_get(address = "Marki, Andersa")
> df_output <- do.call(rbind.data.frame, output)
> df_output
1 Andersa
2 143402
3 0920901
4 00285
5 Marki
6 643949.3987
7 499656.945800001
8 LINESTRING(643691.7537 499759.7709,643714.492 499753.1515,643768.427 499731.363399999,643801.4207 499717.6074,643827.3306 499706.843599999,643949.3987 499656.945800001,644044.1973 499614.359099999,644077.5194 499600.2992,644169.6761 499559.555500001,644200.1808 499546.196699999,644271.0002 499515.1812,644276.6037 499513.287)
9 1
10 1
11 {Marki,143402}
> df_output = sf::st_as_sf(df_output, wkt = "geometry_wkt", crs = 2180)
Error in `[[<-.data.frame`(`*tmp*`, wkt, value = list()) : 
  replacement has 0 rows, data has 11

Concerning the NULL results it may be verified before applying these lines?

EDIT: I noticed that geocodePL_get(rail_crossing = "001 018 478") will give results without geometry_wkt so we cannot use wkt = "geometry_wkt" in sf::st_as_sf.

kadyb commented 4 years ago

I think we should remove

if (length(output) == 1) {
  output = output[[1]]
}

in source code and then use rbind.data.frame, because it will be a nested list. But I can be wrong.

kadyb commented 4 years ago

You check NULLs after

output = jsonlite::fromJSON(prepared_URL)[["results"]]
kadyb commented 4 years ago

EDIT: I noticed that geocodePL_get(rail_crossing = "001 018 478") will give results without geometry_wkt so we cannot use wkt = "geometry_wkt" in sf::st_as_sf.

There is probably geometry_wkt attribute, just we're not returning it on the output currently. https://github.com/kadyb/rgugik/blob/5e01945990da277cea72772194d9d5397faa6a36/R/geocodePL_get.R#L69-L71

BERENZ commented 4 years ago

Ok, I will go back with some improvements to the end of this week.

kadyb commented 4 years ago

Response from GUGiK:

W odpowiedzi na Pańskie pytanie informuję, że w usłudze wprowadzony jest mechanizm blokowania adresów IP, który jest uruchamiany w wyniku przesyłania masowej ilości zapytań do źródłowego serwera usługi. Ograniczenie to ma na celu ochronę usługi na poziomie aplikacyjnym przed nadmierną ilością zapytań wysyłanych od użytkownika, w szczególności ataków DDoS.

W przypadku gdyby na Pański adres IP została nałożona taka blokada, wówczas należy postępować zgodnie z wyświetlonym komunikatem.

So we don't know what the limit is, but I think we can assume that there should be a 1 second delay between requests. If the limit is exceeded, the function will stop working (there will be an error in fromJSON()).

kadyb commented 4 years ago

Fixed in https://github.com/kadyb/rgugik/pull/43.