luukvdmeer / sfnetworks

Tidy Geospatial Networks in R
https://luukvdmeer.github.io/sfnetworks/
Other
338 stars 20 forks source link

Use c++ code for testing coordinate equality of nodes #130

Open agila5 opened 3 years ago

agila5 commented 3 years ago

Describe the bug There is a bug in create_nodes_from_edges() when the input coordinates are really really really close but not exactly "identical" (I'm not even sure how to define "identical" here).

Reproducible example

# packages
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(sfnetworks)

# data
degenerate_sfc <- st_sfc(
  st_linestring(rbind(
    c(
      -59.81698520799989893248493899591267108917236328125, 
      -13.6634482960000003259892764617688953876495361328125
    ), 
    c(
      -59.81698520799988472163022379390895366668701171875, 
      -13.6634482960000003259892764617688953876495361328125
    )
  )), 
  st_linestring(rbind(c(1, 2), c(3, 4))),
  crs = 4326
)

# doesn't work
as_sfnetwork(degenerate_sfc)
#> Error in add_vertices(gr, nrow(nodes) - gorder(gr)): At type_indexededgelist.c:369 : cannot add negative number of vertices, Invalid value

The problem occurs in create_nodes_from_edges() and, more precisely, here: https://github.com/luukvdmeer/sfnetworks/blob/eba28b706ae0a5476b946b67c5a25a7fc1591006/R/utils.R#L68-L72 In fact,

my_nodes <- sfnetworks:::linestring_boundary_points(degenerate_sfc)
(my_indices <- match(my_nodes, unique(my_nodes)))
#> [1] 1 1 3 4

# while 
my_nodes[!duplicated(my_indices)]
#> Geometry set for 3 features 
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: -59.81699 ymin: -13.66345 xmax: 3 ymax: 4
#> geographic CRS: WGS 84
#> POINT (-59.81699 -13.66345)
#> POINT (1 2)
#> POINT (3 4)

has only three rows. Please note that the last line of code was taken from https://github.com/luukvdmeer/sfnetworks/blob/eba28b706ae0a5476b946b67c5a25a7fc1591006/R/utils.R#L81 Anyway, that problem implies that the IDs used by tidygraph are wrong, creating an error. Hence, sooner or later I think we should remove the code that compares the equality of numerical values (which is also the problem flagged by @mpadge here).

I know that it sounds like a pathological example, but I found the problem while running some R code for a GIS SE question.

Expected behavior No error.

R Session Info

Session info ``` r devtools::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 3.6.3 (2020-02-29) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate Italian_Italy.1252 #> ctype Italian_Italy.1252 #> tz Europe/Berlin #> date 2021-02-21 #> #> - Packages ------------------------------------------------------------------- #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) #> callr 3.5.1 2020-10-13 [1] CRAN (R 3.6.3) #> class 7.3-17 2020-04-26 [1] CRAN (R 3.6.3) #> classInt 0.4-3 2020-04-07 [1] CRAN (R 3.6.3) #> cli 2.3.0 2021-01-31 [1] CRAN (R 3.6.3) #> colorspace 2.0-0 2020-11-11 [1] CRAN (R 3.6.3) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 3.6.3) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 3.6.3) #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) #> devtools 2.3.2 2020-09-18 [1] CRAN (R 3.6.3) #> digest 0.6.27 2020-10-24 [1] CRAN (R 3.6.3) #> dplyr 1.0.4 2021-02-02 [1] CRAN (R 3.6.3) #> e1071 1.7-4 2020-10-14 [1] CRAN (R 3.6.3) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.3) #> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0) #> fs 1.5.0 2020-07-31 [1] CRAN (R 3.6.3) #> generics 0.1.0 2020-10-31 [1] CRAN (R 3.6.3) #> ggplot2 3.3.2 2020-06-19 [1] CRAN (R 3.6.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 3.6.3) #> gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0) #> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0) #> htmltools 0.5.0 2020-06-16 [1] CRAN (R 3.6.3) #> igraph 1.2.6 2020-10-06 [1] CRAN (R 3.6.3) #> KernSmooth 2.23-18 2020-10-29 [1] CRAN (R 3.6.3) #> knitr 1.30 2020-09-22 [1] CRAN (R 3.6.3) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 3.6.3) #> lwgeom 0.2-5 2020-06-12 [1] CRAN (R 3.6.3) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 3.6.3) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0) #> pillar 1.4.7 2020-11-20 [1] CRAN (R 3.6.3) #> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 3.6.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.1) #> pkgload 1.1.0 2020-05-29 [1] CRAN (R 3.6.3) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2) #> processx 3.4.4 2020-09-03 [1] CRAN (R 3.6.3) #> ps 1.4.0 2020-10-07 [1] CRAN (R 3.6.3) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 3.6.3) #> R6 2.5.0 2020-10-28 [1] CRAN (R 3.6.3) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 3.6.3) #> remotes 2.2.0 2020-07-21 [1] CRAN (R 3.6.3) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 3.6.3) #> rmarkdown 2.7 2021-02-19 [1] CRAN (R 3.6.3) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 3.6.3) #> scales 1.1.1 2020-05-11 [1] CRAN (R 3.6.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) #> sf * 0.9-7 2021-01-06 [1] CRAN (R 3.6.3) #> sfheaders 0.4.0 2020-12-01 [1] CRAN (R 3.6.3) #> sfnetworks * 0.5.0 2021-02-21 [1] local #> stringi 1.5.3 2020-09-09 [1] CRAN (R 3.6.3) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0) #> testthat 3.0.0 2020-10-31 [1] CRAN (R 3.6.3) #> tibble 3.0.6 2021-01-29 [1] CRAN (R 3.6.3) #> tidygraph 1.2.0 2020-05-12 [1] CRAN (R 3.6.3) #> tidyr 1.1.2 2020-08-27 [1] CRAN (R 3.6.3) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3) #> units 0.6-7 2020-06-13 [1] CRAN (R 3.6.3) #> usethis 2.0.0 2020-12-10 [1] CRAN (R 3.6.3) #> vctrs 0.3.6 2020-12-17 [1] CRAN (R 3.6.3) #> withr 2.3.0 2020-09-22 [1] CRAN (R 3.6.3) #> xfun 0.19 2020-10-30 [1] CRAN (R 3.6.3) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2) #> #> [1] C:/Users/Utente/Documents/R/win-library/3.6 #> [2] C:/Program Files/R/R-3.6.3/library ```
luukvdmeer commented 3 years ago

Hence, sooner or later I think we should remove the code that compares the equality of numerical values

Yes, I would even put it high on the priority list ;)

agila5 commented 3 years ago

Hi everyone. If you are willing to introduce a data.table dependency, then I want to propose the following solution.

Load packages

library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(data.table)

Create a simple example

my_sfc <- st_sfc(
  st_linestring(rbind(c(-1, 0), c(0, 0))), 
  st_linestring(rbind(c(1, 0), c(0, 0))), 
  st_linestring(rbind(c(0, 0), c(0, 1))),
  st_linestring(rbind(c(0, 0), c(0, -1))), 
  st_linestring(rbind(c(0, 1), c(1, 1))), 
  st_linestring(rbind(c(1, 1), c(1, 0))), 
  st_linestring(rbind(c(1, 0), c(1, -1))), 
  st_linestring(rbind(c(0, -1), c(1, -1)))
)

and plot it

par(mar = rep(0, 4))
plot(my_sfc, reset = FALSE)
plot(st_cast(my_sfc, "POINT"), pch = 16, add = TRUE)

Extract the boundary points

nodes <- sfnetworks:::linestring_boundary_points(my_sfc)

Extract their coordinates

nodes_coordinates <- sfheaders::sfc_to_df(nodes)

Convert to data.table format

setDT(nodes_coordinates)

Extract unique coordinates

cols_union <- intersect(colnames(nodes_coordinates), c("x", "y", "z", "m"))
unique_coordinates <- unique(nodes_coordinates, by = cols_union)

Extract the ID of the row numbers

unique_coordinates[nodes_coordinates, which = TRUE, on = cols_union]
#>  [1] 1 2 3 2 2 4 2 5 4 6 6 3 3 7 5 7

The output is identical to the current procedure:

match(nodes, unique(nodes))
#>  [1] 1 2 3 2 2 4 2 5 4 6 6 3 3 7 5 7

Summarise the previous steps into a function

my_match_with_unique <- function(nodes) {
  nodes_coordinates <- sfheaders::sfc_to_df(nodes)
  setDT(nodes_coordinates)
  cols_union <- intersect(colnames(nodes_coordinates), c("x", "y", "z", "m"))
  unique_coordinates <- unique(nodes_coordinates, by = cols_union)
  unique_coordinates[nodes_coordinates, which = TRUE, on = cols_union]
}

and test it again

my_match_with_unique(nodes)
#>  [1] 1 2 3 2 2 4 2 5 4 6 6 3 3 7 5 7

Define the degenerate data used in the previous example

degenerate_sfc <- st_sfc(
  st_linestring(rbind(
    c(
      -59.81698520799989893248493899591267108917236328125, 
      -13.6634482960000003259892764617688953876495361328125
    ), 
    c(
      -59.81698520799988472163022379390895366668701171875, 
      -13.6634482960000003259892764617688953876495361328125
    )
  )), 
  st_linestring(rbind(c(1, 2), c(3, 4))),
  crs = 4326
)

Extract the nodes

nodes <- sfnetworks:::linestring_boundary_points(degenerate_sfc)

Compare new code

my_match_with_unique(nodes)
#> [1] 1 2 3 4

Current (wrong) code

match(nodes, unique(nodes))
#> [1] 1 1 3 4

Test it again with roxel data

nodes <- sfnetworks:::linestring_boundary_points(sfnetworks::roxel)
all(my_match_with_unique(nodes) == match(nodes, unique(nodes)))
#> [1] TRUE

A slightly more difficult example:

iow <- osmextract::oe_get("Isle of Wight")
#> The input place was matched with: Isle of Wight
#> The chosen file was already detected in the download directory. Skip downloading.
#> The corresponding gpkg file was already detected. Skip vectortranslate operations.
#> Reading layer `lines' from data source `C:\Users\Utente\Documents\osm-data\geofabrik_isle-of-wight-latest.gpkg' using driver `GPKG'
#> Simple feature collection with 45621 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -5.401978 ymin: 43.35489 xmax: -0.175775 ymax: 50.89601
#> geographic CRS: WGS 84
nodes <- sfnetworks:::linestring_boundary_points(iow)
bench::mark(
  new = my_match_with_unique(nodes), 
  current = match(nodes, unique(nodes)), 
  check = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new          68.7ms  76.37ms    12.5      8.69MB    1.79 
#> 2 current       1.08s    1.08s     0.923    6.74MB    0.923

Even more difficult

greater_london <- osmextract::oe_get("Greater London")
#> The input place was matched with: Greater London
#> The chosen file was already detected in the download directory. Skip downloading.
#> The corresponding gpkg file was already detected. Skip vectortranslate operations.
#> Reading layer `lines' from data source `C:\Users\Utente\Documents\osm-data\geofabrik_greater-london-latest.gpkg' using driver `GPKG'
#> Simple feature collection with 423239 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -0.6125036 ymin: 51.22684 xmax: 0.399669 ymax: 51.78736
#> geographic CRS: WGS 84
nodes <- sfnetworks:::linestring_boundary_points(greater_london)
bench::mark(
  new = my_match_with_unique(nodes), 
  current = match(nodes, unique(nodes)), 
  check = TRUE
)
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new        729.49ms 729.49ms     1.37     76.9MB        0
#> 2 current       8.59s    8.59s     0.116    57.1MB        0

Created on 2021-03-13 by the reprex package (v1.0.0)

To summarise:

  1. It seems that the new approach works for this type of problems and it is also slightly faster than the current code (totally irrelevant in all cases but really large spatial networks)
  2. Introduce new data.table dependency.

I know that it may create more problems than benefits, but I don't know how to replicate the following code using regular data.frame or Matrix objects:

unique_coordinates[nodes_coordinates, which = TRUE, on = cols_union]

EDIT: Add more details

I also tested a combination of igraph and sf::st_equals_exact setting par = .Machine$double.eps but the problem is that running sf::st_equals_exact on a lot of nodes is quite slow. The same examples presented before run in 100/1000x the current time.

A more general question: I think that the problem of matching and extracting unique sfg objects is much more general than our problems but, AFAIK, there are no ad-hoc methods in sf. Hence, I was thinking of creating a new issue there asking for feedback. I will never be able to code the new functions myself with Rcpp or similar stuff so, for the moment, I would simply propose the solution detailed above.

luukvdmeer commented 3 years ago

Thanks @agila5! Interesting approach, did not think before of doing it like that. Personally I am a bit reluctant to include the data.table dependency, since we already have quite some heavy dependencies with the tidyverse packages, either directly (dplyr and tibble) or indirectly through tidygraph. But when it adds large advantages, also for other parts of the code, we should of course consider it either way.

I have been thinking about this and came up with an idea of using st_equals inside a custom st_match function. The idea would be to find for each point the set indices of the points that equal it, and then take the smallest of these indices. Then, we use match as before, but on the integer indices instead of the coordinate values themselves. Something like:

st_match = function(x) {
  idxs = do.call("c", lapply(st_equals(x), `[`, 1))
  match(idxs, unique(idxs))
}

Some checks and benchmarks. Note that the runtime of st_match is comparable (even a bit faster) to our current implementation, but slower than the data.table solution:

nodes <- sfnetworks:::linestring_boundary_points(my_sfc)
my_match_with_unique(nodes)  
# [1] 1 2 3 2 2 4 2 5 4 6 6 3 3 7 5 7 
st_match(nodes)  
# [1] 1 2 3 2 2 4 2 5 4 6 6 3 3 7 5 7

nodes <- sfnetworks:::linestring_boundary_points(degenerate_sfc)
my_match_with_unique(nodes)
# [1] 1 2 3 4
st_match(nodes)
# [1] 1 2 3 4

nodes <- sfnetworks:::linestring_boundary_points(sfnetworks::roxel)
all(my_match_with_unique(nodes) == st_match(nodes))
# [1] TRUE

iow <- osmextract::oe_get("Isle of Wight")
#> The input place was matched with: Isle of Wight
#> The chosen file was already detected in the download directory. Skip downloading.
#> The corresponding gpkg file was already detected. Skip vectortranslate operations.
#> Reading layer `lines' from data source `C:\Users\Utente\Documents\osm-data\geofabrik_isle-of-wight-latest.gpkg' using driver `GPKG'
#> Simple feature collection with 45621 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -5.401978 ymin: 43.35489 xmax: -0.175775 ymax: 50.89601
#> geographic CRS: WGS 84
nodes <- sfnetworks:::linestring_boundary_points(iow)
bench::mark(
  dt = my_match_with_unique(nodes), 
  st = st_match(nodes),
  base = match(nodes, unique(nodes)), 
  check = TRUE
 )
# A tibble: 3 x 13
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
# 1 dt           46.3ms   51.4ms     19.7     8.75MB     9.84    10
# 2 st          685.4ms  685.4ms      1.46    8.23MB     4.38     1
# 3 base        847.6ms  847.6ms      1.18    5.76MB     2.36     1
# … with 6 more variables: n_gc <dbl>, total_time <bch:tm>,
#   result <list>, memory <list>, time <list>, gc <list>

greater_london <- osmextract::oe_get("Greater London")
#> The input place was matched with: Greater London
#> The chosen file was already detected in the download directory. Skip downloading.
#> The corresponding gpkg file was already detected. Skip vectortranslate operations.
#> Reading layer `lines' from data source `C:\Users\Utente\Documents\osm-data\geofabrik_greater-london-latest.gpkg' using driver `GPKG'
#> Simple feature collection with 423239 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -0.6125036 ymin: 51.22684 xmax: 0.399669 ymax: 51.78736
#> geographic CRS: WGS 84
nodes <- sfnetworks:::linestring_boundary_points(greater_london)
bench::mark(
  dt = my_match_with_unique(nodes), 
  st = st_match(nodes),
  base = match(nodes, unique(nodes)), 
  check = TRUE
 )
# A tibble: 3 x 13
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int>
# 1 dt         955.89ms 955.89ms     1.05     81.3MB    1.05      1
# 2 st             7.6s     7.6s     0.132    75.4MB    0.527     1
# 3 base          7.97s    7.97s     0.126      51MB    0.251     1
# … with 6 more variables: n_gc <dbl>, total_time <bch:tm>,
#   result <list>, memory <list>, time <list>, gc <list>

What I like about the st_equals solution:

What I like about the data.table solution:

I think we need to balance the pros and cons. Any ideas on that?

agila5 commented 3 years ago

Hi @luukvdmeer. Great ideas and great solution (as always). I would adopt your approach since it adds no extra dependency and (I'm pretty sure) nobody cares about 6 seconds of extra computing time when working with such a large spatial network.

mpadge commented 3 years ago

Here's a C++ version that has 3 important advantages:

  1. No extra deps at all
  2. Faster than any other option
  3. Most importantly, has the absolutely necessary explicit precision parameter to control the precision used to determine equality

The source code is this, presumed to be in a local file src.cpp:

#include <utility> // for std::pair
#include <Rcpp.h>

typedef std::pair <long int, long int> nodePair_t;

struct pair_hash {
    inline std::size_t operator() (const nodePair_t & n) const {
        return n.first * 31. + n.second;
    }
};

// [[Rcpp::export]]
Rcpp::IntegerVector cpp_match (Rcpp::List g, double precision = 1e-10) {

    const long int prec_int = round (1 / precision);

    Rcpp::IntegerVector index (g.size ());

    std::unordered_map <nodePair_t, int, pair_hash> nodeMap;

    int count = 0;
    int n = 0;
    int index_i = 0;
    for (auto i: g) {

        const std::vector <double> xy = Rcpp::as <std::vector <double>> (i);
        nodePair_t nodes = std::make_pair (
                round (xy [0] * prec_int),
                round (xy [1] * prec_int));

        if (nodeMap.find (nodes) == nodeMap.end ()) {
            n = count;
            nodeMap.emplace (nodes, count++);
        } else {
            n = nodeMap.at (nodes);
        }

        // store n+1 for direct 1-based R indexing:
        index (index_i++) = n + 1;
    }

    return index;
}

And here is the benchmarking against the current solutions:

library (sfnetworks)
library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.0.4, PROJ 6.3.2
library (data.table)
library(Rcpp)
library (osmextract)

sourceCpp ("src.cpp")

iow <- osmextract::oe_get("Isle of Wight")
nodes <- sfnetworks:::linestring_boundary_points(iow)

my_match_with_unique <- function(nodes) {
  nodes_coordinates <- sfheaders::sfc_to_df(nodes)
  setDT(nodes_coordinates)
  cols_union <- intersect(colnames(nodes_coordinates), c("x", "y", "z", "m"))
  unique_coordinates <- unique(nodes_coordinates, by = cols_union)
  unique_coordinates[nodes_coordinates, which = TRUE, on = cols_union]
}
st_match = function(x) {
  idxs = do.call("c", lapply(st_equals(x), `[`, 1))
  match(idxs, unique(idxs))
}

bench::mark(
  dt = my_match_with_unique(nodes), 
  st = st_match (nodes),
  base = match(nodes, unique(nodes)), 
  cpp = cpp_match (nodes),
  check = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 x 6
#>   expression            min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>       <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dt                 45.7ms     49ms     19.6         NA     7.83
#> 2 st                750.3ms    750ms      1.33        NA     4.00
#> 3 base              706.5ms    707ms      1.42        NA     0   
#> 4 cpp                46.1ms     47ms     21.2         NA     0

# and most importrantly:
identical (st_match (nodes), cpp_match (nodes))
#> TRUE

Created on 2021-03-15 by the reprex package (v1.0.0)

Also a clear demonstration of how amazing data.table really is!! But the data.table solution should not be implemented because it still can't control for numeric precision in equality of floating-point numbers. st_match can via st_equals, but is way slower than straight C++.

@luukvdmeer You currently have no src/ directory, and no C++ code at all. If you want to include this, it would minimally require, I would suggest, one additional cpp11 dependency, which is very lightweight. I strongly suspect you'll eventually want to rewrite functionality in C++ regardless, so think adding a dep like that would be unavoidable in the long run. Let me know if you want a PR along these lines.

agila5 commented 3 years ago

Hi @mpadge. I don't understand the C++ code, but thank you very much for your assistance!

mpadge commented 3 years ago

Sorry @agila5, to clarify: save the C++ chunk above as a local file called src.cpp, then just run the following to load that file into your workspace:

library (Rcpp)
sourceCpp ("src.cpp")

You can then call it just like the other functions, and use it in benchmarking. Don't worry about the details of how it works, but please do mess around with the precision value! LIke this:

library (sfnetworks)
library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.0.4, PROJ 6.3.2
library(Rcpp)
library (osmextract)

sourceCpp ("src.cpp")

iow <- osmextract::oe_get("Isle of Wight")
nodes <- sfnetworks:::linestring_boundary_points(iow)

precision <- 10 ^ -(1:10)
nunique <- vapply (precision, function (i)
                   length (unique (cpp_match (nodes, precision = i))),
                   integer (1))
x <- data.frame (precision = precision,
                 n = nunique)
library (ggplot2)
ggplot (x, aes (x = precision, y = n)) +
    geom_line () +
    scale_x_log10 () +
    scale_y_log10 ()

Created on 2021-03-15 by the reprex package (v1.0.0)

luukvdmeer commented 3 years ago

Thanks @mpadge that looks great! One question I have: the default precision you use (i.e. 1e-10), do you know if this is a commonly used default for such equality checks, e.g. also used in st_equals (I guess they should also internally set some precision right?)

mpadge commented 3 years ago

There is no specific reason, just a generic value; lots of routines use around 1e-12 because you want values at least thousands of times larger than .Machine$double.eps, which is often around 1e-16. sf itself is a kind of look-to-the-future system in which data will specify their own intrinsic precision (see st_precision(), for example). This is all part of the WKB spec, and all works perfectly as long as data do indeed specify their own precision. In the meantime, sf has a default value here:

https://github.com/r-spatial/sf/blob/3c0dd42fcf4d1a9a4a606b5af4f5bf7800b34d55/R/geom-predicates.R#L35

which happens to perfectly sensibly be 0.0. And that means extracting st_precision() from any data which do not explicitly set that value - notably including all OSM data -- simply returns zero. sf then rightly uses that value, and so effectively presumes geometries are specified at (arbitrary and unreliable) machine precision. What that really means is that it is up to users of st_equals to confirm themselves what they are doing. And what they should be doing is using st_equals_exact() with an explicit value for par. And that returns to general computational practices of setting it at least thousands of times greater than eps, so defaulting to something like O(e-10 -- e-12). The above graph is meant to show that it really doesn't matter what the value is, as long as it is something less than e-6 or so.

These considerations nevertheless suggest that it would be good to first check st_precision() of the input data, and to take that value where it is > 0, otherwise to default to 1e-12 or whatever.

agila5 commented 3 years ago

Good morning @mpadge, and (again) thank you very much for your time, help, and explanations. I have just one more consideration regarding the precision argument. I tested the C++ code using projected coordinates (i.e. not 4326), and I think that, in those cases, we should adopt a different default value:

# packages
library(ggplot2)

# load cpp function
Rcpp::sourceCpp("C:/Users/Utente/Desktop/src.cpp")

# OSM data
iow <- sf::st_transform(osmextract::oe_get("Isle of Wight"), 27700)
nodes <- sfnetworks:::linestring_boundary_points(iow)

precision <- 10 ^ -(1:12)
nunique <- vapply (precision, function (i) {
    length(unique (cpp_match (nodes, precision = i)))
  },
  integer (1)
)

x <- data.frame (
  precision = precision,
  n = nunique
)
ggplot (x, aes (x = precision, y = n)) +
  geom_line () +
  scale_x_log10 () +
  scale_y_log10 ()

Created on 2021-03-16 by the reprex package (v1.0.0)

Maybe it's possible to modify the C++ code and merge the two scenarios (I don't know), but I just wanted to highlight that (at the moment) the default value of precision should take into account the unit of measurement of input data (meter or degree)

mpadge commented 3 years ago

That can be handled straight in R by something like the following lines, which estimate how many digits are used to represent the whole-number part of coordinates, as well as the maximal possible number that can be used to store integer values.

n <- vapply (nodes [[1]], function (i)
             nchar (as.character (round (i))),
             integer (1))
nmax <- nchar (as.character (.Machine$integer.max))

For 4326, n is (2, 2), whereas for 27700, it is (6, 5). nmax is 10, and it would generally suffice to then set precision = 10 ^ (-(nmax - n + extra_bit)), where extra_bit might be 1 or 2 just to make sure. With extra_bit = 2, you'd then get precision = 1e-10 for 4326, and 1e-6 for 27700. Note that the C++ code uses a long int which does not exist in R, and is much bigger than R's .Machine$integer.max. This is nevertheless still machine dependent, and @agila5, I'd guess somehow your machine is doing something strange with that, because n should never increase like in your result. I just get a perfectly flat line from your code, with precision having no effect. You are clearly seeing an overflow effect, where n increases because your values for small precision start getting rounded to some upper limit of your machine representation of integer values. This is definitely a machine effect on your side, and illustrates precisely the kind of reason why explicitly considering and controlling precision is really important. The above code snippets should do the job.

agila5 commented 3 years ago

This is nevertheless still machine dependent, and @agila5, I'd guess somehow your machine is doing something strange with that, because n should never increase like in your result. I just get a perfectly flat line from your code, with precision having no effect.

I think it must be related to the fact that I use Windows on my laptop because when I run the same code on a virtual machine (running ubuntu 16.04) I observe the same behaviour that you describe (i.e. a flat line going down for smaller values of precision).

agila5 commented 3 years ago

Just for "fun", I tried to test those problems a little bit more (since I really don't want to update sfnetworks with something that makes my laptop useless 😅), and these are the results. It's a long comment, sorry about that. I add as much verbose output and details as possible (in case they are somehow useful).

On Ubuntu 16.04 I observe the same behaviour that you describe, i.e. C++ code uses a long int which does not exist in R and is much bigger than R's .Machine$integer.max.

# load cpp function
Rcpp::sourceCpp(
  code = '
  #include <Rcpp.h>
  // [[Rcpp::export]]
  int cpp_match (double precision = 1e-10) {

  const long int prec_int = round (1 / precision);
  Rcpp::Rcout << "prec_int is equal to: " << prec_int << std::endl;
  return prec_int;
  }', 
  verbose = TRUE
)
#> 
#> Generated extern "C" functions 
#> --------------------------------------------------------
#> 
#> 
#> #include <Rcpp.h>
#> // cpp_match
#> int cpp_match(double precision);
#> RcppExport SEXP sourceCpp_1_cpp_match(SEXP precisionSEXP) {
#> BEGIN_RCPP
#>     Rcpp::RObject rcpp_result_gen;
#>     Rcpp::RNGScope rcpp_rngScope_gen;
#>     Rcpp::traits::input_parameter< double >::type precision(precisionSEXP);
#>     rcpp_result_gen = Rcpp::wrap(cpp_match(precision));
#>     return rcpp_result_gen;
#> END_RCPP
#> }
#> 
#> Generated R functions 
#> -------------------------------------------------------
#> 
#> `.sourceCpp_1_DLLInfo` <- dyn.load('/tmp/RtmpmMDLp0/sourceCpp-x86_64-pc-linux-gnu-1.0.6/sourcecpp_83071fec14b1/sourceCpp_2.so')
#> 
#> cpp_match <- Rcpp:::sourceCppFunction(function(precision = 1e-10) {}, FALSE, `.sourceCpp_1_DLLInfo`, 'sourceCpp_1_cpp_match')
#> 
#> rm(`.sourceCpp_1_DLLInfo`)
#> 
#> Building shared library
#> --------------------------------------------------------
#> 
#> DIR: /tmp/RtmpmMDLp0/sourceCpp-x86_64-pc-linux-gnu-1.0.6/sourcecpp_83071fec14b1
#> 
#> /usr/lib/R/bin/R CMD SHLIB -o 'sourceCpp_2.so' 'file830749bf89dd.cpp'

res <- vapply(
  X = 10 ^ (-(0:12)), 
  FUN = cpp_match, 
  FUN.VALUE = integer(1)
)
#> prec_int is equal to: 1
#> prec_int is equal to: 10
#> prec_int is equal to: 100
#> prec_int is equal to: 1000
#> prec_int is equal to: 10000
#> prec_int is equal to: 100000
#> prec_int is equal to: 1000000
#> prec_int is equal to: 10000000
#> prec_int is equal to: 100000000
#> prec_int is equal to: 1000000000
#> prec_int is equal to: 10000000000
#> prec_int is equal to: 100000000000
#> prec_int is equal to: 1000000000000

res
#>  [1]          1         10        100       1000      10000     100000
#>  [7]    1000000   10000000  100000000 1000000000 1410065408 1215752192
#> [13] -727379968

Created on 2021-03-17 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 3.6.3 (2020-02-29) #> os Ubuntu 16.04.6 LTS #> system x86_64, linux-gnu #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Rome #> date 2021-03-17 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [3] CRAN (R 3.5.3) #> backports 1.1.10 2020-09-15 [1] CRAN (R 3.6.3) #> callr 3.4.4 2020-09-07 [1] CRAN (R 3.6.3) #> cli 2.2.0 2020-11-20 [1] CRAN (R 3.6.3) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.3) #> desc 1.2.0 2018-05-01 [3] CRAN (R 3.5.0) #> devtools 2.2.2 2020-02-17 [3] CRAN (R 3.6.2) #> digest 0.6.27 2020-10-24 [1] CRAN (R 3.6.3) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.3) #> evaluate 0.14 2019-05-28 [3] CRAN (R 3.6.0) #> fansi 0.4.1 2020-01-08 [3] CRAN (R 3.6.2) #> fs 1.3.2 2020-03-05 [3] CRAN (R 3.6.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 3.6.3) #> highr 0.8 2019-03-20 [3] CRAN (R 3.5.3) #> htmltools 0.5.0 2020-06-16 [1] CRAN (R 3.6.3) #> knitr 1.30 2020-09-22 [1] CRAN (R 3.6.3) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 3.6.3) #> memoise 1.1.0 2017-04-21 [3] CRAN (R 3.5.0) #> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 3.6.3) #> pkgload 1.1.0 2020-05-29 [1] CRAN (R 3.6.3) #> prettyunits 1.1.1 2020-01-24 [3] CRAN (R 3.6.2) #> processx 3.4.4 2020-09-03 [1] CRAN (R 3.6.3) #> ps 1.3.4 2020-08-11 [1] CRAN (R 3.6.3) #> R6 2.5.0 2020-10-28 [1] CRAN (R 3.6.3) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 3.6.3) #> remotes 2.1.1 2020-02-15 [3] CRAN (R 3.6.2) #> rlang 0.4.9 2020-11-26 [1] CRAN (R 3.6.3) #> rmarkdown 2.5 2020-10-21 [1] CRAN (R 3.6.3) #> rprojroot 1.3-2 2018-01-03 [3] CRAN (R 3.5.0) #> sessioninfo 1.1.1 2018-11-05 [3] CRAN (R 3.5.1) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 3.6.3) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.3) #> testthat 2.3.2 2020-03-02 [3] CRAN (R 3.6.3) #> usethis 1.6.1 2020-04-29 [1] CRAN (R 3.6.3) #> withr 2.3.0 2020-09-22 [1] CRAN (R 3.6.3) #> xfun 0.21 2021-02-10 [1] CRAN (R 3.6.3) #> yaml 2.2.1 2020-02-01 [3] CRAN (R 3.6.2) #> #> [1] /home/dsuser/R/x86_64-pc-linux-gnu-library/3.6 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```

On the other hand, on Windows:

# load cpp function
Rcpp::sourceCpp(
  code = '
  #include <Rcpp.h>
  // [[Rcpp::export]]
  int cpp_match (double precision = 1e-10) {

  const long int prec_int = round (1 / precision);
  Rcpp::Rcout << "prec_int is equal to: " << prec_int << std::endl;
  return prec_int;
  }', 
  verbose = TRUE
)
#> 
#> Generated extern "C" functions 
#> --------------------------------------------------------
#> 
#> 
#> #include <Rcpp.h>
#> // cpp_match
#> int cpp_match(double precision);
#> RcppExport SEXP sourceCpp_1_cpp_match(SEXP precisionSEXP) {
#> BEGIN_RCPP
#>     Rcpp::RObject rcpp_result_gen;
#>     Rcpp::RNGScope rcpp_rngScope_gen;
#>     Rcpp::traits::input_parameter< double >::type precision(precisionSEXP);
#>     rcpp_result_gen = Rcpp::wrap(cpp_match(precision));
#>     return rcpp_result_gen;
#> END_RCPP
#> }
#> 
#> Generated R functions 
#> -------------------------------------------------------
#> 
#> `.sourceCpp_1_DLLInfo` <- dyn.load('C:/Users/Utente/AppData/Local/Temp/RtmpSUEJy1/sourceCpp-x86_64-w64-mingw32-1.0.6/sourcecpp_2bd017893460/sourceCpp_2.dll')
#> 
#> cpp_match <- Rcpp:::sourceCppFunction(function(precision = 1e-10) {}, FALSE, `.sourceCpp_1_DLLInfo`, 'sourceCpp_1_cpp_match')
#> 
#> rm(`.sourceCpp_1_DLLInfo`)
#> 
#> Building shared library
#> --------------------------------------------------------
#> 
#> DIR: C:/Users/Utente/AppData/Local/Temp/RtmpSUEJy1/sourceCpp-x86_64-w64-mingw32-1.0.6/sourcecpp_2bd017893460
#> 
#> C:/PROGRA~1/R/R-40~1.4/bin/x64/R CMD SHLIB -o "sourceCpp_2.dll" "file2bd053da4388.cpp"

res <- vapply(
  X = 10 ^ (-(0:12)), 
  FUN = cpp_match, 
  FUN.VALUE = integer(1)
)
#> prec_int is equal to: 1
#> prec_int is equal to: 10
#> prec_int is equal to: 100
#> prec_int is equal to: 1000
#> prec_int is equal to: 10000
#> prec_int is equal to: 100000
#> prec_int is equal to: 1000000
#> prec_int is equal to: 10000000
#> prec_int is equal to: 100000000
#> prec_int is equal to: 1000000000
#> prec_int is equal to: -2147483648
#> prec_int is equal to: -2147483648
#> prec_int is equal to: -2147483648

res
#>  [1]          1         10        100       1000      10000     100000
#>  [7]    1000000   10000000  100000000 1000000000         NA         NA
#> [13]         NA

Created on 2021-03-17 by the reprex package (v1.0.0)

Session info ``` r sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.0.4 (2021-02-15) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate Italian_Italy.1252 #> ctype Italian_Italy.1252 #> tz Europe/Berlin #> date 2021-03-17 #> #> - Packages ------------------------------------------------------------------- #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.4) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3) #> cli 2.3.1 2021-02-23 [1] CRAN (R 4.0.4) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.4) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.4) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.4) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.4) #> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.4) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.4) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.4) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.4) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.4) #> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.4) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.4) #> pillar 1.5.1 2021-03-05 [1] CRAN (R 4.0.4) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.4) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.4) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.4) #> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.4) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.4) #> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.4) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.4) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.3) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.4) #> styler 1.3.2 2020-02-23 [1] CRAN (R 4.0.4) #> tibble 3.1.0 2021-02-25 [1] CRAN (R 4.0.4) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.4) #> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.4) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.4) #> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.4) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.3) #> #> [1] C:/Users/Utente/Documents/R/win-library/4.0 #> [2] C:/Program Files/R/R-4.0.4/library ```

Unfortunately, that also implies that I cannot use cpp_match on my laptop when precision is greater or equal to 10^-9. For example (with CRS: 4326):

# load cpp function
Rcpp::sourceCpp("src.cpp") # the same code that you shared

# load data
roads <- sfnetworks::roxel

# extract nodes
nodes <- sfnetworks:::linestring_boundary_points(roads)

table(cpp_match(nodes), useNA = "ifany")
#> 
#>    1 
#> 1702

Created on 2021-03-17 by the reprex package (v1.0.0)

Does it make sense to use long long int or unsigned long long int? I really don't know anything about C types, but I read on the wiki page that they should support bigger numbers (stored as exact integers). For example (on Windows):

# load cpp function
Rcpp::sourceCpp(
  code = '
  #include <Rcpp.h>
  // [[Rcpp::export]]
  int cpp_match (double precision = 1e-10) {

  const unsigned long long int prec_int = round (1 / precision);
  Rcpp::Rcout << "prec_int is equal to: " << prec_int << std::endl;
  return prec_int;
  }', 
  verbose = TRUE
)
#> 
#> Generated extern "C" functions 
#> --------------------------------------------------------
#> 
#> 
#> #include <Rcpp.h>
#> // cpp_match
#> int cpp_match(double precision);
#> RcppExport SEXP sourceCpp_1_cpp_match(SEXP precisionSEXP) {
#> BEGIN_RCPP
#>     Rcpp::RObject rcpp_result_gen;
#>     Rcpp::RNGScope rcpp_rngScope_gen;
#>     Rcpp::traits::input_parameter< double >::type precision(precisionSEXP);
#>     rcpp_result_gen = Rcpp::wrap(cpp_match(precision));
#>     return rcpp_result_gen;
#> END_RCPP
#> }
#> 
#> Generated R functions 
#> -------------------------------------------------------
#> 
#> `.sourceCpp_1_DLLInfo` <- dyn.load('C:/Users/Utente/AppData/Local/Temp/RtmpE7A6zT/sourceCpp-x86_64-w64-mingw32-1.0.6/sourcecpp_39b04576172f/sourceCpp_2.dll')
#> 
#> cpp_match <- Rcpp:::sourceCppFunction(function(precision = 1e-10) {}, FALSE, `.sourceCpp_1_DLLInfo`, 'sourceCpp_1_cpp_match')
#> 
#> rm(`.sourceCpp_1_DLLInfo`)
#> 
#> Building shared library
#> --------------------------------------------------------
#> 
#> DIR: C:/Users/Utente/AppData/Local/Temp/RtmpE7A6zT/sourceCpp-x86_64-w64-mingw32-1.0.6/sourcecpp_39b04576172f
#> 
#> C:/PROGRA~1/R/R-40~1.4/bin/x64/R CMD SHLIB -o "sourceCpp_2.dll" "file39b067833a46.cpp"

res <- vapply(
  X = 10 ^ (-(0:16)), 
  FUN = cpp_match, 
  FUN.VALUE = integer(1)
)
#> prec_int is equal to: 1
#> prec_int is equal to: 10
#> prec_int is equal to: 100
#> prec_int is equal to: 1000
#> prec_int is equal to: 10000
#> prec_int is equal to: 100000
#> prec_int is equal to: 1000000
#> prec_int is equal to: 10000000
#> prec_int is equal to: 100000000
#> prec_int is equal to: 1000000000
#> prec_int is equal to: 10000000000
#> prec_int is equal to: 100000000000
#> prec_int is equal to: 1000000000000
#> prec_int is equal to: 10000000000000
#> prec_int is equal to: 100000000000000
#> prec_int is equal to: 1000000000000000
#> prec_int is equal to: 10000000000000000

res
#>  [1]           1          10         100        1000       10000      100000
#>  [7]     1000000    10000000   100000000  1000000000  1410065408  1215752192
#> [13]  -727379968  1316134912   276447232 -1530494976  1874919424

Created on 2021-03-17 by the reprex package (v1.0.0)

Session info ``` r sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.0.4 (2021-02-15) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate Italian_Italy.1252 #> ctype Italian_Italy.1252 #> tz Europe/Berlin #> date 2021-03-17 #> #> - Packages ------------------------------------------------------------------- #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.4) #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.0.3) #> cli 2.3.1 2021-02-23 [1] CRAN (R 4.0.4) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.4) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.4) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.4) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.4) #> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.4) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.4) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.4) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.4) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.4) #> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.4) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.4) #> pillar 1.5.1 2021-03-05 [1] CRAN (R 4.0.4) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.4) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.4) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.4) #> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.4) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.4) #> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.4) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.4) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.3) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.4) #> styler 1.3.2 2020-02-23 [1] CRAN (R 4.0.4) #> tibble 3.1.0 2021-02-25 [1] CRAN (R 4.0.4) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.4) #> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.4) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.4) #> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.4) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.3) #> #> [1] C:/Users/Utente/Documents/R/win-library/4.0 #> [2] C:/Program Files/R/R-4.0.4/library ```

I have no idea about the drawbacks.

mpadge commented 3 years ago

Wow, thanks @agila5, those results on windows are really ... interesting. The short answer to your Q is, Yes, you can easily just use long long int instead of long int, and there are no drawbacks at all. There shouldn't even be any speed drawbacks.

A slight longer answer: if you really wanted to ensure that all was truly okay back over in R, you could determine a precision value like this:

xymax <- max (abs (st_coordinates (nodes)))
precision <- .Machine$integer.max / xymax
precision <- nchar (as.character (round (precision)))
extra_bit <- 2 # or whatever
precision <- 1 / 10 ^ (precision - extra_bit)

But using long long int would mean you shouldn't even need that. These long int / long long int things only have to exist in C++ to determine whether nodal coordinates are "equal" or not, and have nothing to do with R, so it doesn't matter that your values returned to R are complete rubbish, as long as the C++ representations are good.

luukvdmeer commented 3 years ago

Just to fix the bug for the time being I am now using st_match instead of match. This takes the "high priority" of this issue. Still interested to add the C++ version, but need more time to look into it (which I am currently short of!).

mpadge commented 3 years ago

No worries @luukvdmeer, time is precious. As said, the C++ code is all there, so happy to PR when you've got time to review it - just let me know.