ropensci / osmdata

R package for downloading OpenStreetMap data
https://docs.ropensci.org/osmdata
314 stars 45 forks source link

add C++ implementation of osmdata_data_frame #298

Closed mpadge closed 1 year ago

mpadge commented 1 year ago

@jmaspons Here is a C++ implemtation of osmdata_data_frame. The following code includes the final R function which can replace your current code for xml_to_df() if you think this is okay, and if you approve this PR.

library(xml2)
# devtools::load_all (export_all = TRUE)
library (osmdata)
#> ℹ Loading osmdata
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
q0 <- opq (bbox = c(-0.27, 51.47, -0.20, 51.50))
q1 <- add_osm_feature (q0, key = "name", value = "River Thames")

f <- test_path ("export.osm")
if (!file.exists (f)) {
    doc <- osmdata_xml (q1, filename = f)
}
doc <- xml2::read_xml (f)

impl <- list ()
impl$preallocate_sel <- function (doc) {
    xml_to_df (doc)
}
impl$get_osmdata_df <- function (doc) {
    xml_to_df_cpp (doc)
}
impl <- lapply(impl, compiler::cmpfun)

mb <- microbenchmark::microbenchmark (
    impl$preallocate_sel (doc),
    impl$get_osmdata_df (doc),
    times = 20L,
    check = NULL # Not equal, because 'preallocate_sel' does not return 'osm_id' column
)
ggplot2::autoplot (mb)
#> Coordinate system already present. Adding new coordinate system, which will
#> replace the existing one.

Created on 2022-12-30 with reprex v2.0.2

mpadge commented 1 year ago

@jmaspons A couple of notes:

  1. I'll request that you review this code once you've commented in here, and had a go yourself to see if you find any issues.
  2. The main difference to your xml_to_df() function is that yours has the explicit code for timestamps, changesets, and the other stuff related to (a)diff queries. I haven't actually run the new version, but the C++ code should simply pick all of those values up anytime they are present in the doc, so (I'm pretty sure ...) this version does not need that code.
  3. The test coverage can all be fixed up by removing the body of your xml_to_df, and replacing with body of xml_to_df_cpp.

Let me know what you think. Thanks :+1:

jmaspons commented 1 year ago

Cool, less carbon to the atmosphere :) @mpadge ! I'm reviewing the code and will response asap, but it's probably ok as all tests pass

The behaviour you describe in your 2nd point about the timestamps, changesets, & Co. , is true for all other osmdata_* implemented in c++? If not, would be a good improvement. If yes, we can fix the opq docs that state that out="meta" only works for osmdata_xml and osmdata_data_frame, and perhaps add some tests.

mpadge commented 1 year ago

@jmaspons Thanks so much for your review of the code. Note that the above commit restores all of the C++ dimnames construction lines. I had already accepted your suggestions to remove all of those lines, but forgot that the restructre_kv_mat function, which is called at the end of all functions here, actually needs and uses that information: https://github.com/ropensci/osmdata/blob/e9dedbb79d0995f259caa3ea3e79648f19515158/src/convert-osm-rcpp.cpp#L149-L150 So all has now been reinstated.

mpadge commented 1 year ago

@jmaspons And the above commit acually reverts back to your suggestion for all C++ internal functions to directly return Rcpp::DataFrame objects. There was a reason why those were all passed by reference in the other (sf/sc) versions, but I can't remember what that was, and in this case, it's definitely cleaner to do it as you suggested, so thanks.

jmaspons commented 1 year ago

Good! I'm testing the correctness of the implementation. For the basic case without metadata nor adiff queries it produces the same results as a version in main except for the column order (suggested but more work in the c++ side required), row names (suggested above) and row order (it's ok as is).

For queries with metadata, it should be extracted from the attributes of each object, as for the osmid (eg. <way id="1056516701" version="1" timestamp="2022-05-01T21:51:18Z" changeset="120436630" uid="508" user="Welshie">). Then I rename attributes prepending `osm` to the metadata columns. Perhaps is a good idea to prepend also a _ to reduce the chances of clashes with key names, at least in the c++ party (it already happens with osm_id)

Adiff queries also require more work, but perhaps we can skip this for now, as the xml structure is quite different from regular queries.

Here is the code I used for testing:

## Correctness ----
resR <- impl$preallocate_sel (doc)
resCpp <- impl$get_osmdata_df (doc)

# all.equal(resR, resCpp)
setequal(names(resR), names(resCpp))
#> TRUE

# Fix differences
resCpp <- resCpp[, names(resR)]
resCpp <- resCpp[order(resCpp$osm_type, resCpp$osm_id), ]
resR <- resR[order(resR$osm_type, resR$osm_id), ]
rownames(resCpp) <- NULL
rownames(resR) <- NULL

all.equal(resR, resCpp)
#> TRUE

### For meta queries ----

qmeta <- opq (bbox = c(-0.27, 51.47, -0.20, 51.50), out = "meta")
qmeta <- add_osm_feature (qmeta, key = "name", value = "River Thames")

f <- test_path ("export_meta.osm")
if (!file.exists (f)) {
    doc_meta <- osmdata_xml (qmeta, filename = f)
}
doc_meta <- xml2::read_xml (f)

resR <- impl$preallocate_sel (doc_meta)
resCpp <- impl$get_osmdata_df (doc_meta)

# all.equal(resR, resCpp)
# setequal(names(resR), names(resCpp))
setdiff(names(resR), names(resCpp))
#> [1] "osm_version"   "osm_timestamp" "osm_changeset" "osm_uid"       "osm_user"  
setdiff(names(resCpp), names(resR))
#> character(0)

### For adiff queries ----

qadiff <- opq (bbox = c(-0.27, 51.47, -0.20, 51.50), out = "tags", adiff = TRUE,
               datetime = "2020-01-01T00:00:00Z")
qadiff <- add_osm_feature (qadiff, key = "name", value = "River Thames")

f <- test_path ("export_adiff.osm")
if (!file.exists (f)) {
    doc_adiff <- osmdata_xml (qadiff, filename = f)
}
doc_adiff <- xml2::read_xml (f)

resR <- osmdata:::xml_adiff_to_df (doc_adiff, datetime_from = "2020-01-01T00:00:00Z", datetime_to = "20223-01-01T00:00:00Z")
resCpp <- impl$get_osmdata_df (doc_adiff)

# all.equal(resR, resCpp)
# setequal(names(resR), names(resCpp))
setdiff(names(resR), names(resCpp))
#> [1] "adiff_action"  "adiff_date"    "adiff_visible"
setdiff(names(resCpp), names(resR))
#> character(0)
mpadge commented 1 year ago

@jmaspons Finally got some time to work on this again. The metadata can be extracted in C++ for no extra overhead, so i want to do all now, then same for adiff data, then i'll merge this. The above commit starts by extracting metadata for nodes, providing a template for the rest.

mpadge commented 1 year ago

@jmaspons The above commits get us this far:

devtools::load_all (export_all = TRUE)
#> ℹ Loading osmdata
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright

impl <- list ()
impl$preallocate_sel <- function (doc) {
    xml_to_df (doc)
}
impl$get_osmdata_df <- function (doc) {
    xml_to_df_cpp (doc)
}
impl <- lapply(impl, compiler::cmpfun)

### No metadata -----

q0 <- opq (bbox = c(-0.27, 51.47, -0.20, 51.50))
q1 <- add_osm_feature (q0, key = "name", value = "River Thames")

f <- test_path ("export.osm")
if (!file.exists (f)) {
    doc <- osmdata_xml (q1, filename = f)
}
doc <- xml2::read_xml (f)

resR <- impl$preallocate_sel (doc)
resCpp <- impl$get_osmdata_df (doc)

waldo::compare (resR, resCpp)
#> ✔ No differences

### With  metadata -----

qmeta <- opq (bbox = c(-0.27, 51.47, -0.20, 51.50), out = "meta")
qmeta <- add_osm_feature (qmeta, key = "name", value = "River Thames")

f <- test_path ("export_meta.osm")
if (!file.exists (f)) {
    doc_meta <- osmdata_xml (qmeta, filename = f)
}
doc_meta <- xml2::read_xml (f)

resR <- impl$preallocate_sel (doc_meta)
resCpp <- impl$get_osmdata_df (doc_meta)

waldo::compare (resR, resCpp)
#> ✔ No differences

Created on 2023-01-16 with reprex v2.0.2

TODO:


Edit: Can't do adiff, because those data are indeed categorically different, and would require a separate implementation of the entire XmlData class used to extract all data from the OSM files: https://github.com/ropensci/osmdata/blob/9491714de1d7d439539a420b01f0aa56701b647d/src/osmdata.h#L129

The current R code from @jmaspons is plenty efficient enough for adiff data anyway, so it'll just be left as R code for now.