ropensci / stplanr

Sustainable transport planning with R
https://docs.ropensci.org/stplanr
Other
417 stars 66 forks source link

Refactor rnet_breakup_vertices #451

Closed agila5 closed 3 years ago

agila5 commented 3 years ago

Hi! This is just the first draft and I will add more docs/examples/tests tomorrow (or in the free time ASAP). ATM I have just one question. I had to move data.table and sfheaders from Suggests to Imports, any objection?

Also please check the second commit, related to route().

agila5 commented 3 years ago

Some simple benchmarks. Current approach:

# install stplanr
remotes::install_github("ropensci/stplanr")

# get data
iow <- osmextract::oe_get("Isle of Wight", quiet = TRUE)
nrow(iow)
#> [1] 45422

# current approach
system.time(({
  iow_breakup <- stplanr::rnet_breakup_vertices(iow)
}))
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#>    user  system elapsed 
#>  210.87    4.30  247.69
nrow(iow_breakup)
#> [1] 80944

Created on 2020-12-02 by the reprex package (v0.3.0)

New approach:

# install stplanr
remotes::install_github("ropensci/stplanr", "6ae27d0b81370fe9b5838fe879cc983db929d55c")

# get data
iow <- osmextract::oe_get("Isle of Wight", quiet = TRUE)
nrow(iow)
#> [1] 45422

# current approach
system.time(({
  iow_breakup <- stplanr::rnet_breakup_vertices(iow)
}))
#> Splitting the input object at the internal points that are also boundaries.
#> Splitting rnet object at the duplicated internal points.
#>    user  system elapsed 
#>    7.11    0.66    8.73
nrow(iow_breakup)
#> [1] 80957

Created on 2020-12-02 by the reprex package (v0.3.0)

The outputs are not exactly identical for the reasons documented also in https://github.com/ropensci/stplanr/issues/416#issuecomment-727600757 (i.e. linestrings that intersect themselves aren't split with the current approach, so it's not surprising that the new approach returns more lines). I graphically compared the examples reported in the help page, and they look identical.

TODOs

I plan to finish the PR on Saturday.

Robinlovelace commented 3 years ago
247.69/8.73
#> [1] 28.37228

Created on 2020-12-03 by the reprex package (v0.3.0)

You just made my pkg 30 times faster :exploding_head: :tada: :tada: :tada:

Robinlovelace commented 3 years ago

Great work @agila5. All your suggestions look good, look forward to Sunday :+1: Only comment for now:

Add some tests at the beginning of my_st_split (since it doesn't make any sense to force a data.table input) and write docs for my_st_split (and maybe a new name sweat_smile )

How about fst_split()?

agila5 commented 3 years ago

Unfortunately (well, not really), I think this now outdated.

Robinlovelace commented 3 years ago

You still got it 30 times faster, which may have inspired the other approach : )

agila5 commented 3 years ago

Fix #416. Next step: wait for @mpadge's comments and Rcpp approach (no time pressure!) and add some tests.

Robinlovelace commented 3 years ago

Looking good @agila5, ready to test/merge I guess :tada:

agila5 commented 3 years ago

Looking good @agila5, ready to test/merge I guess 🎉

IMO, yes it's ready to be merged. Clearly feel free to test it locally if you want. I think the only failing test is unrelated to the PR.

Robinlovelace commented 3 years ago

Hi @agila5 I've tested it with the following reproducible code:

# Aim: test rnet_breakup_vertices

remotes::install_github("ropensci/stplanr")
library(stplanr)

rnet = osm_net_example
system.time(({
  rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))

nrow(rnet_processed1)
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"])

remotes::install_github("agila5/stplanr", "refactor_rnet_breakup")
library(stplanr)

rnet = osm_net_example
system.time(({
  rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))

nrow(rnet_processed1)
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"])

I think there is a small problem.

Robinlovelace commented 3 years ago

Result of first test with old function:

remotes::install_github("ropensci/stplanr")
#> Skipping install of 'stplanr' from a github remote, the SHA1 (784f64dd) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(stplanr)

rnet = osm_net_example
system.time(({
  rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#>    user  system elapsed 
#>   0.189   0.004   0.196

nrow(rnet_processed1)
#> [1] 129
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"])

Created on 2020-12-08 by the reprex package (v0.3.0)

Robinlovelace commented 3 years ago

Result with new function:

remotes::install_github("agila5/stplanr", "refactor_rnet_breakup")
#> Skipping install of 'stplanr' from a github remote, the SHA1 (463ff935) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(stplanr)

rnet = osm_net_example
system.time(({
  rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))
#>    user  system elapsed 
#>   0.199   0.012   0.212

nrow(rnet_processed1)
#> [1] 129
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"])

Created on 2020-12-08 by the reprex package (v0.3.0)

agila5 commented 3 years ago

Fixed! It was just a typo at the end.

# updates
remotes::install_github("ropensci/stplanr", "286a667f5e04ad102b070cd0d91ae836bf8e4905")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'stplanr' from a github remote, the SHA1 (286a667f) has not changed since last install.
#>   Use `force = TRUE` to force installation

# packages
library(stplanr)

# data
rnet = osm_net_example
system.time(({
  rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))
#>    user  system elapsed 
#>    0.20    0.05    0.33

rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"])

Created on 2020-12-08 by the reprex package (v0.3.0)

Robinlovelace commented 3 years ago

Final test:

> nrow(rnet)
[1] 45464
> system.time(({
+   rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
+ }))
   user  system elapsed 
  1.641   0.008   1.481 
> nrow(rnet_processed1)
[1] 81040
mpadge commented 3 years ago

I'll bury this here just for a comparative reference, starting with modified C++ code now that I see how you're approaching the re-assembly to linestring objects:

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::DataFrame getids (Rcpp::List g) {
    const int n = g.size ();

    int npoints = 0;

    std::unordered_set <std::string> endset, midset, middupset;
    Rcpp::List mids (n);
    for (Rcpp::NumericMatrix i: g) {
        Rcpp::List dimnames = i.attr ("dimnames");
        Rcpp::CharacterVector rownames = dimnames (0);

        npoints += rownames.size ();

        endset.emplace (rownames (0));
        endset.emplace (rownames (rownames.size () - 1));

        rownames.erase (0);
        rownames.erase (rownames.size () - 1);
        for (auto j: rownames)
        {
            std::string js = Rcpp::as <std::string> (j);
            if (midset.find (js) != midset.end ())
                middupset.emplace (j);
            midset.emplace (js);
        }
    }

    std::unordered_set <std::string> mid_is_end;
    for (auto i: midset) {
        if (endset.find (i) != endset.end ())
            mid_is_end.emplace (i);
    }

    // work out how many points in total:
    int index = 0;
    for (Rcpp::NumericMatrix gi: g) {
        Rcpp::List dimnames = gi.attr ("dimnames");
        Rcpp::CharacterVector rownames = dimnames (0);
        for (int j = 0; j < rownames.size (); j++)
        {
            std::string js = Rcpp::as <std::string> (rownames (j));
            index++;
            if (mid_is_end.find (js) != mid_is_end.end ())
                index++;
        }
    }

    std::vector <int> geom_num_vec (index);
    std::vector <double> x (index), y (index);

    size_t geom_num = 1;
    index = 0; // index into output vectors
    for (Rcpp::NumericMatrix gi: g) {
        Rcpp::List dimnames = gi.attr ("dimnames");
        Rcpp::CharacterVector rownames = dimnames (0);
        for (int j = 0; j < gi.nrow (); j++)
        {
            std::string js = Rcpp::as <std::string> (rownames (j));
            geom_num_vec [index] = geom_num;
            x [index] = gi (j, 0);
            y [index] = gi (j, 1);
            index++;
            if (mid_is_end.find (js) != mid_is_end.end ())
            {
                geom_num_vec [index] = geom_num++;
                x [index] = gi (j, 0);
                y [index] = gi (j, 1);
                index++;
            }
        }
        geom_num++;
    }

    Rcpp::DataFrame res =
        Rcpp::DataFrame::create (
                Rcpp::Named ("x") = x,
                Rcpp::Named ("y") = y,
                Rcpp::Named ("geom") = geom_num_vec,
                Rcpp::_["stringsAsFactors"] = false);

    return res;
    }

Saving that file as "breakup.cpp" then gives the following:

library (stplanr)
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.2
library(Rcpp)
sourceCpp ("breakup.cpp")

breakup_cpp <- function (rnet) {
    dat <- getids (rnet$geometry)

    new_linestring <- sfheaders::sfc_linestring(
        dat,
        linestring_id = "geom"
    )
    new_linestring_sfc <- sf::st_sfc(
        new_linestring,
        crs = sf::st_crs(rnet),
        precision = sf::st_precision(rnet)
    )
    return (new_linestring_sfc)
}

fname <- "ms-streetnet.Rds"
if (!file.exists (fname)) {
    library (dodgr)
    net <- dodgr_streetnet ("Münster Germany")
    saveRDS (net, file = fname)
} else {
    rnet <- readRDS (fname)
}

bench::mark(check = FALSE,
            stplr = {rnet_breakup_vertices (rnet)},
            rcid = {breakup_cpp(rnet)},
            time_unit = "ms")
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 stplr      2847.  2847.     0.351        NA     5.97
#> 2 rcid       1214.  1214.     0.824        NA     2.47

nrow (rnet_breakup_vertices (rnet)); length (breakup_cpp (rnet))
#> [1] 84438
#> [1] 92022

Created on 2020-12-08 by the reprex package (v0.3.0)

The final numbers are the crucial point: Using OSM IDs to ascertain whether vertices are identical yields more geoms than using coordinates, because the latter invariably rounds some values to be numerically identical even though the vertices are actually different. Those rounding error conflate non-identical vertices and ultimately give you fewer unique geometries than what there rightfully are. Happy to PR that code if you like

Robinlovelace commented 3 years ago

Many thanks for this faster and seemingly more accurate implementation.

Happy to PR that code if you like

That would be very much appreciated. Would be good to identify exactly what is different, including identifying specific linestrings that are split in the C++ implementation but not in the R one.

mpadge commented 3 years ago

identifying specific linestrings that are split in the C++ implementation but not in the R one.

Yeah, that's obviously important. I don't have time to look further at the mo, but feel free to explore a bit. How about I open a PR and we can start examining there? Or would a new issue be better for future reference?

Robinlovelace commented 3 years ago

Yeah, that's obviously important. I don't have time to look further at the mo, but feel free to explore a bit. How about I open a PR and we can start examining there? Or would a new issue be better for future reference?

Both would be awesome: the issue being that the current implementation misses breaks and the PR fixing it I guess. Very grateful for your input on this!

agila5 commented 3 years ago

Hi @mpadge and, again, thank you very much. I plan to compare the two implementations in a few days (friday or weekend, probably).

mpadge commented 3 years ago

Cool, thanks Andrea - i'll wait to hear back from you first then, and we can proceed after that.