ropensci / osmdata

R package for downloading OpenStreetMap data
https://docs.ropensci.org/osmdata
317 stars 46 forks source link

available_tags does not work on some features (aeroway, barrier,...) #207

Closed espinielli closed 4 years ago

espinielli commented 4 years ago

Due to the way the feature tables in the OSM Features page are constructed the following line in available_tags() does not always work

        tags <- rvest::html_nodes (pg, sprintf("a[title^='Tag:%s']", feature))

In fact the tables for aeroway, barrier ... are constructed differently (they are autogenerated) compared to the others and hence the a[title^='Tag:%s'] fails to retrieve all the tags.

The tags for aeroway are 12 in the Feature Page, but only 1 is returned:

library(osmdata)
#> Warning: package 'osmdata' was built under R version 3.6.2
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
available_tags("aeroway")
#> [1] "hangar"

Created on 2020-02-12 by the reprex package (v0.3.0)

available_tags fails on:

  1. aeroway
  2. barrier
  3. craft
  4. emergency
  5. geological
  6. historic
  7. leisure
  8. man_made/man made
  9. military
  10. natural
  11. office
  12. place
  13. public_transport/public transport
  14. sport
  15. telecom
  16. tourism
  17. waterway
espinielli commented 4 years ago

Probably it is best to retrieve available keys and tags via the taginfo API:

keys responses are paged with a max of 999 with a total attribute indicating the overall amount of entries. The sequence of the remaining page numbers is 2:(total %% 999) and each page can be retrieved after the first and combined together.

espinielli commented 4 years ago

The following experimental code could be used to retrieve all keys:

library(stringr)
library(jsonlite)
library(httr)
library(dplyr)

pagenum <- 1
url <- "https://taginfo.openstreetmap.org/api/4/keys/all?page={pagenum}&rp=999"
u <- str_glue(url)
g <- httr::GET(u)
if (httr::status_code(g) == 200) {
  c <- httr::content(g, as = "text") %>% jsonlite::fromJSON()
  p <- c$data %>% as_tibble()
  total <- c$total
  pages <- 2:(total %% 999)
  str_glue(url, pagenum = pages) %>% 
    purrr::map_dfr(.f = function(u) {
      httr::GET(u) %>% 
        httr::content(as = "text") %>%
        jsonlite::fromJSON() %>%
        `[[`("data") %>%
        as_tibble()
    }) %>% 
    bind_rows(p)
} 

TODO

ISSUES

mpadge commented 4 years ago

Thanks @espinielli, that's a very good catch there. The available_...() hasn't been updated for way too long, and you've offered a brilliant extension. If you are up to a PR, I'd happily incorporate your suggestions, but noting that the original intention of those functions was to reflect the "official" lists on https://wiki.openstreetmap.org/wiki/Map_Features, rather than the actual complete lists as recorded in the entire database, and as provided by the taginfo service. I still think this is a sensible approach according to a working hypothesis that these lists of tags are most commonly used to re-map any "rogue" tags into standard values. That said, the ability to extract all tags used in the entire database is also obviously cool and potentially important in some cases, so how about keeping that original functionality by default, but allowing for an additional argument to both functions, taginfo = FALSE, which enables full values when TRUE? You could then do a PR for the TRUE case.

That would then need the following:

  1. An explanation of what the taginfo parameter means and does
  2. An indication that the extraction may take quite some time.
  3. A reduction of dependencies from the way you've currently written the otherwise excellent code. This is really only stringr and purrr. The former is easy to remove, and the latter would have the advantage that you'd be able to insert a utils::txtProgressBar within a standard loop, which would be really helpful because, yes, it really does take quite some time.
  4. A trivial modification for value extraction via
    url <- sprintf ("https://taginfo.openstreetmap.org/api/4/key/values?key=%s&page={pagenum}&rp=999", value)

Your initial issue can then be solved by re-writing the code for taginfo = FALSE to something like the following:

library (magrittr)
url_ftrs <- "https://wiki.openstreetmap.org/wiki/Map_Features"
pg <- xml2::read_html (httr::GET (url_ftrs))
taglists <- rvest::html_nodes (pg, "div[class='taglist']") %>%
    rvest::html_attr ("data-taginfo-taglist-tags")
taglists <- lapply (taglists, function (i) {
                        temp <- strsplit (i, "=") [[1]]
                        res <- NULL
                        if (length (temp) == 2) {
                            res <- strsplit (temp [2], ",")
                            names (res) <- temp [1]
                        }
                        return (res)    })
taglists [vapply (taglists, is.null, logical (1))] <- NULL
head (taglists)
#> [[1]]
#> [[1]]$aeroway
#>  [1] "aerodrome"     "apron"         "gate"          "hangar"       
#>  [5] "helipad"       "heliport"      "navigationaid" "runway"       
#>  [9] "spaceport"     "taxiway"       "terminal"      "windsock"     
#> 
#> 
#> [[2]]
#> [[2]]$barrier
#>  [1] "cable_barrier"  "city_wall"      "ditch"          "fence"         
#>  [5] "guard_rail"     "handrail"       "hedge"          "kerb"          
#>  [9] "retaining_wall" "wall"          
#> 
#> 
#> [[3]]
#> [[3]]$barrier
#>  [1] "block"                 "bollard"               "border_control"       
#>  [4] "bump_gate"             "bus_trap"              "cattle_grid"          
#>  [7] "chain"                 "cycle_barrier"         "debris"               
#> [10] "entrance"              "full-height_turnstile" "gate"                 
#> [13] "hampshire_gate"        "height_restrictor"     "horse_stile"          
#> [16] "jersey_barrier"        "kissing_gate"          "lift_gate"            
#> [19] "log"                   "motorcycle_barrier"    "rope"                 
#> [22] "sally_port"            "spikes"                "stile"                
#> [25] "sump_buster"           "swing_gate"            "toll_booth"           
#> [28] "turnstile"             "yes"                  
#> 
#> 
#> [[4]]
#> [[4]]$craft
#>  [1] "agricultural_engines"    "atelier"                
#>  [3] "bakery"                  "basket_maker"           
#>  [5] "beekeeper"               "blacksmith"             
#>  [7] "boatbuilder"             "bookbinder"             
#>  [9] "brewery"                 "builder"                
#> [11] "cabinet_maker"           "carpenter"              
#> [13] "carpet_layer"            "caterer"                
#> [15] "chimney_sweeper"         "clockmaker"             
#> [17] "confectionery"           "cooper"                 
#> [19] "dental_technician"       "distillery"             
#> [21] "door_construction"       "dressmaker"             
#> [23] "electronics_repair"      "embroiderer"            
#> [25] "electrician"             "engraver"               
#> [27] "floorer"                 "gardener"               
#> [29] "glaziery"                "grinding_mill"          
#> [31] "handicraft"              "hvac"                   
#> [33] "insulation"              "jeweller"               
#> [35] "joiner"                  "key_cutter"             
#> [37] "locksmith"               "metal_construction"     
#> [39] "mint"                    "musical_instrument"     
#> [41] "oil_mill"                "optician"               
#> [43] "organ_builder"           "painter"                
#> [45] "parquet_layer"           "photographer"           
#> [47] "photographic_laboratory" "piano_tuner"            
#> [49] "plasterer"               "plumber"                
#> [51] "pottery"                 "printer"                
#> [53] "printmaker"              "rigger"                 
#> [55] "roofer"                  "saddler"                
#> [57] "sailmaker"               "sawmill"                
#> [59] "scaffolder"              "sculptor"               
#> [61] "shoemaker"               "signmaker"              
#> [63] "stand_builder"           "stonemason"             
#> [65] "sun_protection"          "tailor"                 
#> [67] "tiler"                   "tinsmith"               
#> [69] "toolmaker"               "turner"                 
#> [71] "upholsterer"             "watchmaker"             
#> [73] "water_well_drilling"     "window_construction"    
#> [75] "winery"                 
#> 
#> 
#> [[5]]
#> [[5]]$emergency
#> [1] "ambulance_station"       "defibrillator"          
#> [3] "landing_site"            "emergency_ward_entrance"
#> 
#> 
#> [[6]]
#> [[6]]$emergency
#> [1] "dry_riser_inlet"   "fire_alarm_box"    "fire_extinguisher"
#> [4] "fire_hose"         "fire_hydrant"      "water_tank"       
#> [7] "suction_point"

Created on 2020-02-13 by the reprex package (v0.3.0)

The whole list can then be returned by default from available_tags by simply changing https://github.com/ropensci/osmdata/blob/d0f30d3df0b2eac4b5e5243e8df5a79ab4db3115/R/features.R#L43 to

available_tags = function (feature = NULL)

and only sub-selecting for non-NULL values of feature.


If you do a PR, then I'd be more than happy to add you as an official package contributor, because this definitely adds entirely new functionality. Thanks very much for opening the issue, and for already thinking so deeply about an excellent solution :+1: :rocket:

espinielli commented 4 years ago

@mpadge I like and subscribe to your proposal to extend the available_() functions with the extra taginfo argument.

So I would say the work has two parts:

  1. fix broken available_() functions so as to return what is the official lists of keys and taga as per https://wiki.openstreetmap.org/wiki/Map_Features, along the lines of the code you proposed above
  2. extend available_() function with the extra taginfo as you proposed

If ok with you I would create a separate issue for the extension and keep this one for the fix (or you can do it 😄).

If ok, I would give a try for the relevant PRs for both.

mpadge commented 4 years ago

Thanks for all the great input and the PRs @espinielli. Just an FYI that i've broken an :muscle: and will be out of action for a little while. I'll get back onto this as soon as I return to normal capabilities.

espinielli commented 4 years ago

no rush, look after your health first and good luck!

On Mon, Feb 17, 2020 at 7:42 PM mark padgham notifications@github.com wrote:

Thanks for all the great input and the PRs @espinielli https://github.com/espinielli. Just an FYI that i've broken an 💪 and will be out of action for a little while. I'll get back onto this as soon as I return to normal capabilities.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ropensci/osmdata/issues/207?email_source=notifications&email_token=AAGZWLCOWXNQQGGCN6OTG6LRDLLA3A5CNFSM4KT5SQI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEL7LEZA#issuecomment-587117156, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGZWLCXHLH4CPBPEVSWC7TRDLLA3ANCNFSM4KT5SQIQ .

mpadge commented 4 years ago

@espinielli update: slowly getting there and hoping to be on to your issues & PRs next week :smile: