Closed mpadge closed 1 year ago
@jmaspons A couple of notes:
xml_to_df()
function is that yours has the explicit code for timestamps, changesets, and the other stuff related to (a)diff queries. I haven't actually run the new version, but the C++ code should simply pick all of those values up anytime they are present in the doc
, so (I'm pretty sure ...) this version does not need that code.xml_to_df
, and replacing with body of xml_to_df_cpp
.Let me know what you think. Thanks :+1:
Cool, less carbon to the atmosphere :) @mpadge ! I'm reviewing the code and will response asap, but it's probably ok as all tests pass
The behaviour you describe in your 2nd point about the timestamps, changesets, & Co. , is true for all other osmdata_* implemented in c++? If not, would be a good improvement. If yes, we can fix the opq
docs that state that out="meta"
only works for osmdata_xml
and osmdata_data_frame
, and perhaps add some tests.
@jmaspons Thanks so much for your review of the code. Note that the above commit restores all of the C++ dimnames
construction lines. I had already accepted your suggestions to remove all of those lines, but forgot that the restructre_kv_mat
function, which is called at the end of all functions here, actually needs and uses that information: https://github.com/ropensci/osmdata/blob/e9dedbb79d0995f259caa3ea3e79648f19515158/src/convert-osm-rcpp.cpp#L149-L150
So all has now been reinstated.
@jmaspons And the above commit acually reverts back to your suggestion for all C++ internal functions to directly return Rcpp::DataFrame
objects. There was a reason why those were all passed by reference in the other (sf/sc) versions, but I can't remember what that was, and in this case, it's definitely cleaner to do it as you suggested, so thanks.
Good! I'm testing the correctness of the implementation. For the basic case without metadata nor adiff queries it produces the same results as a version in main except for the column order (suggested but more work in the c++ side required), row names (suggested above) and row order (it's ok as is).
For queries with metadata, it should be extracted from the attributes of each object, as for the osmid (eg. <way id="1056516701" version="1" timestamp="2022-05-01T21:51:18Z" changeset="120436630" uid="508" user="Welshie">
). Then I rename attributes prepending `osm` to the metadata columns. Perhaps is a good idea to prepend also a _ to reduce the chances of clashes with key names, at least in the c++ party (it already happens with osm_id)
Adiff queries also require more work, but perhaps we can skip this for now, as the xml structure is quite different from regular queries.
Here is the code I used for testing:
## Correctness ----
resR <- impl$preallocate_sel (doc)
resCpp <- impl$get_osmdata_df (doc)
# all.equal(resR, resCpp)
setequal(names(resR), names(resCpp))
#> TRUE
# Fix differences
resCpp <- resCpp[, names(resR)]
resCpp <- resCpp[order(resCpp$osm_type, resCpp$osm_id), ]
resR <- resR[order(resR$osm_type, resR$osm_id), ]
rownames(resCpp) <- NULL
rownames(resR) <- NULL
all.equal(resR, resCpp)
#> TRUE
### For meta queries ----
qmeta <- opq (bbox = c(-0.27, 51.47, -0.20, 51.50), out = "meta")
qmeta <- add_osm_feature (qmeta, key = "name", value = "River Thames")
f <- test_path ("export_meta.osm")
if (!file.exists (f)) {
doc_meta <- osmdata_xml (qmeta, filename = f)
}
doc_meta <- xml2::read_xml (f)
resR <- impl$preallocate_sel (doc_meta)
resCpp <- impl$get_osmdata_df (doc_meta)
# all.equal(resR, resCpp)
# setequal(names(resR), names(resCpp))
setdiff(names(resR), names(resCpp))
#> [1] "osm_version" "osm_timestamp" "osm_changeset" "osm_uid" "osm_user"
setdiff(names(resCpp), names(resR))
#> character(0)
### For adiff queries ----
qadiff <- opq (bbox = c(-0.27, 51.47, -0.20, 51.50), out = "tags", adiff = TRUE,
datetime = "2020-01-01T00:00:00Z")
qadiff <- add_osm_feature (qadiff, key = "name", value = "River Thames")
f <- test_path ("export_adiff.osm")
if (!file.exists (f)) {
doc_adiff <- osmdata_xml (qadiff, filename = f)
}
doc_adiff <- xml2::read_xml (f)
resR <- osmdata:::xml_adiff_to_df (doc_adiff, datetime_from = "2020-01-01T00:00:00Z", datetime_to = "20223-01-01T00:00:00Z")
resCpp <- impl$get_osmdata_df (doc_adiff)
# all.equal(resR, resCpp)
# setequal(names(resR), names(resCpp))
setdiff(names(resR), names(resCpp))
#> [1] "adiff_action" "adiff_date" "adiff_visible"
setdiff(names(resCpp), names(resR))
#> character(0)
@jmaspons Finally got some time to work on this again. The metadata can be extracted in C++ for no extra overhead, so i want to do all now, then same for adiff data, then i'll merge this. The above commit starts by extracting metadata for nodes, providing a template for the rest.
@jmaspons The above commits get us this far:
devtools::load_all (export_all = TRUE)
#> ℹ Loading osmdata
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
impl <- list ()
impl$preallocate_sel <- function (doc) {
xml_to_df (doc)
}
impl$get_osmdata_df <- function (doc) {
xml_to_df_cpp (doc)
}
impl <- lapply(impl, compiler::cmpfun)
### No metadata -----
q0 <- opq (bbox = c(-0.27, 51.47, -0.20, 51.50))
q1 <- add_osm_feature (q0, key = "name", value = "River Thames")
f <- test_path ("export.osm")
if (!file.exists (f)) {
doc <- osmdata_xml (q1, filename = f)
}
doc <- xml2::read_xml (f)
resR <- impl$preallocate_sel (doc)
resCpp <- impl$get_osmdata_df (doc)
waldo::compare (resR, resCpp)
#> ✔ No differences
### With metadata -----
qmeta <- opq (bbox = c(-0.27, 51.47, -0.20, 51.50), out = "meta")
qmeta <- add_osm_feature (qmeta, key = "name", value = "River Thames")
f <- test_path ("export_meta.osm")
if (!file.exists (f)) {
doc_meta <- osmdata_xml (qmeta, filename = f)
}
doc_meta <- xml2::read_xml (f)
resR <- impl$preallocate_sel (doc_meta)
resCpp <- impl$get_osmdata_df (doc_meta)
waldo::compare (resR, resCpp)
#> ✔ No differences
Created on 2023-01-16 with reprex v2.0.2
Edit: Can't do adiff, because those data are indeed categorically different, and would require a separate implementation of the entire XmlData
class used to extract all data from the OSM files:
https://github.com/ropensci/osmdata/blob/9491714de1d7d439539a420b01f0aa56701b647d/src/osmdata.h#L129
The current R code from @jmaspons is plenty efficient enough for adiff data anyway, so it'll just be left as R code for now.
@jmaspons Here is a C++ implemtation of
osmdata_data_frame
. The following code includes the final R function which can replace your current code forxml_to_df()
if you think this is okay, and if you approve this PR.Created on 2022-12-30 with reprex v2.0.2