Open agila5 opened 3 years ago
Hence, sooner or later I think we should remove the code that compares the equality of numerical values
Yes, I would even put it high on the priority list ;)
Hi everyone. If you are willing to introduce a data.table
dependency, then I want to propose the following solution.
Load packages
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(data.table)
Create a simple example
my_sfc <- st_sfc(
st_linestring(rbind(c(-1, 0), c(0, 0))),
st_linestring(rbind(c(1, 0), c(0, 0))),
st_linestring(rbind(c(0, 0), c(0, 1))),
st_linestring(rbind(c(0, 0), c(0, -1))),
st_linestring(rbind(c(0, 1), c(1, 1))),
st_linestring(rbind(c(1, 1), c(1, 0))),
st_linestring(rbind(c(1, 0), c(1, -1))),
st_linestring(rbind(c(0, -1), c(1, -1)))
)
and plot it
par(mar = rep(0, 4))
plot(my_sfc, reset = FALSE)
plot(st_cast(my_sfc, "POINT"), pch = 16, add = TRUE)
Extract the boundary points
nodes <- sfnetworks:::linestring_boundary_points(my_sfc)
Extract their coordinates
nodes_coordinates <- sfheaders::sfc_to_df(nodes)
Convert to data.table
format
setDT(nodes_coordinates)
Extract unique coordinates
cols_union <- intersect(colnames(nodes_coordinates), c("x", "y", "z", "m"))
unique_coordinates <- unique(nodes_coordinates, by = cols_union)
Extract the ID of the row numbers
unique_coordinates[nodes_coordinates, which = TRUE, on = cols_union]
#> [1] 1 2 3 2 2 4 2 5 4 6 6 3 3 7 5 7
The output is identical to the current procedure:
match(nodes, unique(nodes))
#> [1] 1 2 3 2 2 4 2 5 4 6 6 3 3 7 5 7
Summarise the previous steps into a function
my_match_with_unique <- function(nodes) {
nodes_coordinates <- sfheaders::sfc_to_df(nodes)
setDT(nodes_coordinates)
cols_union <- intersect(colnames(nodes_coordinates), c("x", "y", "z", "m"))
unique_coordinates <- unique(nodes_coordinates, by = cols_union)
unique_coordinates[nodes_coordinates, which = TRUE, on = cols_union]
}
and test it again
my_match_with_unique(nodes)
#> [1] 1 2 3 2 2 4 2 5 4 6 6 3 3 7 5 7
Define the degenerate data used in the previous example
degenerate_sfc <- st_sfc(
st_linestring(rbind(
c(
-59.81698520799989893248493899591267108917236328125,
-13.6634482960000003259892764617688953876495361328125
),
c(
-59.81698520799988472163022379390895366668701171875,
-13.6634482960000003259892764617688953876495361328125
)
)),
st_linestring(rbind(c(1, 2), c(3, 4))),
crs = 4326
)
Extract the nodes
nodes <- sfnetworks:::linestring_boundary_points(degenerate_sfc)
Compare new code
my_match_with_unique(nodes)
#> [1] 1 2 3 4
Current (wrong) code
match(nodes, unique(nodes))
#> [1] 1 1 3 4
Test it again with roxel
data
nodes <- sfnetworks:::linestring_boundary_points(sfnetworks::roxel)
all(my_match_with_unique(nodes) == match(nodes, unique(nodes)))
#> [1] TRUE
A slightly more difficult example:
iow <- osmextract::oe_get("Isle of Wight")
#> The input place was matched with: Isle of Wight
#> The chosen file was already detected in the download directory. Skip downloading.
#> The corresponding gpkg file was already detected. Skip vectortranslate operations.
#> Reading layer `lines' from data source `C:\Users\Utente\Documents\osm-data\geofabrik_isle-of-wight-latest.gpkg' using driver `GPKG'
#> Simple feature collection with 45621 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: -5.401978 ymin: 43.35489 xmax: -0.175775 ymax: 50.89601
#> geographic CRS: WGS 84
nodes <- sfnetworks:::linestring_boundary_points(iow)
bench::mark(
new = my_match_with_unique(nodes),
current = match(nodes, unique(nodes)),
check = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 new 68.7ms 76.37ms 12.5 8.69MB 1.79
#> 2 current 1.08s 1.08s 0.923 6.74MB 0.923
Even more difficult
greater_london <- osmextract::oe_get("Greater London")
#> The input place was matched with: Greater London
#> The chosen file was already detected in the download directory. Skip downloading.
#> The corresponding gpkg file was already detected. Skip vectortranslate operations.
#> Reading layer `lines' from data source `C:\Users\Utente\Documents\osm-data\geofabrik_greater-london-latest.gpkg' using driver `GPKG'
#> Simple feature collection with 423239 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: -0.6125036 ymin: 51.22684 xmax: 0.399669 ymax: 51.78736
#> geographic CRS: WGS 84
nodes <- sfnetworks:::linestring_boundary_points(greater_london)
bench::mark(
new = my_match_with_unique(nodes),
current = match(nodes, unique(nodes)),
check = TRUE
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 new 729.49ms 729.49ms 1.37 76.9MB 0
#> 2 current 8.59s 8.59s 0.116 57.1MB 0
Created on 2021-03-13 by the reprex package (v1.0.0)
To summarise:
data.table
dependency. I know that it may create more problems than benefits, but I don't know how to replicate the following code using regular data.frame
or Matrix
objects:
unique_coordinates[nodes_coordinates, which = TRUE, on = cols_union]
EDIT: Add more details
I also tested a combination of igraph
and sf::st_equals_exact
setting par = .Machine$double.eps
but the problem is that running sf::st_equals_exact
on a lot of nodes is quite slow. The same examples presented before run in 100/1000x the current time.
A more general question: I think that the problem of matching and extracting unique sfg
objects is much more general than our problems but, AFAIK, there are no ad-hoc methods in sf
. Hence, I was thinking of creating a new issue there asking for feedback. I will never be able to code the new functions myself with Rcpp or similar stuff so, for the moment, I would simply propose the solution detailed above.
Thanks @agila5! Interesting approach, did not think before of doing it like that. Personally I am a bit reluctant to include the data.table
dependency, since we already have quite some heavy dependencies with the tidyverse packages, either directly (dplyr and tibble) or indirectly through tidygraph. But when it adds large advantages, also for other parts of the code, we should of course consider it either way.
I have been thinking about this and came up with an idea of using st_equals
inside a custom st_match
function. The idea would be to find for each point the set indices of the points that equal it, and then take the smallest of these indices. Then, we use match
as before, but on the integer indices instead of the coordinate values themselves. Something like:
st_match = function(x) {
idxs = do.call("c", lapply(st_equals(x), `[`, 1))
match(idxs, unique(idxs))
}
Some checks and benchmarks. Note that the runtime of st_match
is comparable (even a bit faster) to our current implementation, but slower than the data.table solution:
nodes <- sfnetworks:::linestring_boundary_points(my_sfc)
my_match_with_unique(nodes)
# [1] 1 2 3 2 2 4 2 5 4 6 6 3 3 7 5 7
st_match(nodes)
# [1] 1 2 3 2 2 4 2 5 4 6 6 3 3 7 5 7
nodes <- sfnetworks:::linestring_boundary_points(degenerate_sfc)
my_match_with_unique(nodes)
# [1] 1 2 3 4
st_match(nodes)
# [1] 1 2 3 4
nodes <- sfnetworks:::linestring_boundary_points(sfnetworks::roxel)
all(my_match_with_unique(nodes) == st_match(nodes))
# [1] TRUE
iow <- osmextract::oe_get("Isle of Wight")
#> The input place was matched with: Isle of Wight
#> The chosen file was already detected in the download directory. Skip downloading.
#> The corresponding gpkg file was already detected. Skip vectortranslate operations.
#> Reading layer `lines' from data source `C:\Users\Utente\Documents\osm-data\geofabrik_isle-of-wight-latest.gpkg' using driver `GPKG'
#> Simple feature collection with 45621 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: -5.401978 ymin: 43.35489 xmax: -0.175775 ymax: 50.89601
#> geographic CRS: WGS 84
nodes <- sfnetworks:::linestring_boundary_points(iow)
bench::mark(
dt = my_match_with_unique(nodes),
st = st_match(nodes),
base = match(nodes, unique(nodes)),
check = TRUE
)
# A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
# 1 dt 46.3ms 51.4ms 19.7 8.75MB 9.84 10
# 2 st 685.4ms 685.4ms 1.46 8.23MB 4.38 1
# 3 base 847.6ms 847.6ms 1.18 5.76MB 2.36 1
# … with 6 more variables: n_gc <dbl>, total_time <bch:tm>,
# result <list>, memory <list>, time <list>, gc <list>
greater_london <- osmextract::oe_get("Greater London")
#> The input place was matched with: Greater London
#> The chosen file was already detected in the download directory. Skip downloading.
#> The corresponding gpkg file was already detected. Skip vectortranslate operations.
#> Reading layer `lines' from data source `C:\Users\Utente\Documents\osm-data\geofabrik_greater-london-latest.gpkg' using driver `GPKG'
#> Simple feature collection with 423239 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: -0.6125036 ymin: 51.22684 xmax: 0.399669 ymax: 51.78736
#> geographic CRS: WGS 84
nodes <- sfnetworks:::linestring_boundary_points(greater_london)
bench::mark(
dt = my_match_with_unique(nodes),
st = st_match(nodes),
base = match(nodes, unique(nodes)),
check = TRUE
)
# A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
# 1 dt 955.89ms 955.89ms 1.05 81.3MB 1.05 1
# 2 st 7.6s 7.6s 0.132 75.4MB 0.527 1
# 3 base 7.97s 7.97s 0.126 51MB 0.251 1
# … with 6 more variables: n_gc <dbl>, total_time <bch:tm>,
# result <list>, memory <list>, time <list>, gc <list>
What I like about the st_equals solution:
What I like about the data.table solution:
I think we need to balance the pros and cons. Any ideas on that?
Hi @luukvdmeer. Great ideas and great solution (as always). I would adopt your approach since it adds no extra dependency and (I'm pretty sure) nobody cares about 6 seconds of extra computing time when working with such a large spatial network.
Here's a C++ version that has 3 important advantages:
precision
parameter to control the precision used to determine equalityThe source code is this, presumed to be in a local file src.cpp
:
#include <utility> // for std::pair
#include <Rcpp.h>
typedef std::pair <long int, long int> nodePair_t;
struct pair_hash {
inline std::size_t operator() (const nodePair_t & n) const {
return n.first * 31. + n.second;
}
};
// [[Rcpp::export]]
Rcpp::IntegerVector cpp_match (Rcpp::List g, double precision = 1e-10) {
const long int prec_int = round (1 / precision);
Rcpp::IntegerVector index (g.size ());
std::unordered_map <nodePair_t, int, pair_hash> nodeMap;
int count = 0;
int n = 0;
int index_i = 0;
for (auto i: g) {
const std::vector <double> xy = Rcpp::as <std::vector <double>> (i);
nodePair_t nodes = std::make_pair (
round (xy [0] * prec_int),
round (xy [1] * prec_int));
if (nodeMap.find (nodes) == nodeMap.end ()) {
n = count;
nodeMap.emplace (nodes, count++);
} else {
n = nodeMap.at (nodes);
}
// store n+1 for direct 1-based R indexing:
index (index_i++) = n + 1;
}
return index;
}
And here is the benchmarking against the current solutions:
library (sfnetworks)
library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.0.4, PROJ 6.3.2
library (data.table)
library(Rcpp)
library (osmextract)
sourceCpp ("src.cpp")
iow <- osmextract::oe_get("Isle of Wight")
nodes <- sfnetworks:::linestring_boundary_points(iow)
my_match_with_unique <- function(nodes) {
nodes_coordinates <- sfheaders::sfc_to_df(nodes)
setDT(nodes_coordinates)
cols_union <- intersect(colnames(nodes_coordinates), c("x", "y", "z", "m"))
unique_coordinates <- unique(nodes_coordinates, by = cols_union)
unique_coordinates[nodes_coordinates, which = TRUE, on = cols_union]
}
st_match = function(x) {
idxs = do.call("c", lapply(st_equals(x), `[`, 1))
match(idxs, unique(idxs))
}
bench::mark(
dt = my_match_with_unique(nodes),
st = st_match (nodes),
base = match(nodes, unique(nodes)),
cpp = cpp_match (nodes),
check = TRUE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dt 45.7ms 49ms 19.6 NA 7.83
#> 2 st 750.3ms 750ms 1.33 NA 4.00
#> 3 base 706.5ms 707ms 1.42 NA 0
#> 4 cpp 46.1ms 47ms 21.2 NA 0
# and most importrantly:
identical (st_match (nodes), cpp_match (nodes))
#> TRUE
Created on 2021-03-15 by the reprex package (v1.0.0)
Also a clear demonstration of how amazing data.table
really is!! But the data.table
solution should not be implemented because it still can't control for numeric precision in equality of floating-point numbers. st_match
can via st_equals
, but is way slower than straight C++.
@luukvdmeer You currently have no src/
directory, and no C++ code at all. If you want to include this, it would minimally require, I would suggest, one additional cpp11
dependency, which is very lightweight. I strongly suspect you'll eventually want to rewrite functionality in C++ regardless, so think adding a dep like that would be unavoidable in the long run. Let me know if you want a PR along these lines.
Hi @mpadge. I don't understand the C++ code, but thank you very much for your assistance!
Sorry @agila5, to clarify: save the C++ chunk above as a local file called src.cpp
, then just run the following to load that file into your workspace:
library (Rcpp)
sourceCpp ("src.cpp")
You can then call it just like the other functions, and use it in benchmarking. Don't worry about the details of how it works, but please do mess around with the precision
value! LIke this:
library (sfnetworks)
library(sf)
#> Linking to GEOS 3.8.1, GDAL 3.0.4, PROJ 6.3.2
library(Rcpp)
library (osmextract)
sourceCpp ("src.cpp")
iow <- osmextract::oe_get("Isle of Wight")
nodes <- sfnetworks:::linestring_boundary_points(iow)
precision <- 10 ^ -(1:10)
nunique <- vapply (precision, function (i)
length (unique (cpp_match (nodes, precision = i))),
integer (1))
x <- data.frame (precision = precision,
n = nunique)
library (ggplot2)
ggplot (x, aes (x = precision, y = n)) +
geom_line () +
scale_x_log10 () +
scale_y_log10 ()
Created on 2021-03-15 by the reprex package (v1.0.0)
Thanks @mpadge that looks great! One question I have: the default precision you use (i.e. 1e-10
), do you know if this is a commonly used default for such equality checks, e.g. also used in st_equals
(I guess they should also internally set some precision right?)
There is no specific reason, just a generic value; lots of routines use around 1e-12
because you want values at least thousands of times larger than .Machine$double.eps
, which is often around 1e-16
. sf
itself is a kind of look-to-the-future system in which data will specify their own intrinsic precision (see st_precision()
, for example). This is all part of the WKB spec, and all works perfectly as long as data do indeed specify their own precision. In the meantime, sf
has a default value here:
which happens to perfectly sensibly be 0.0. And that means extracting st_precision()
from any data which do not explicitly set that value - notably including all OSM data -- simply returns zero. sf
then rightly uses that value, and so effectively presumes geometries are specified at (arbitrary and unreliable) machine precision. What that really means is that it is up to users of st_equals
to confirm themselves what they are doing. And what they should be doing is using st_equals_exact()
with an explicit value for par
. And that returns to general computational practices of setting it at least thousands of times greater than eps
, so defaulting to something like O(e-10 -- e-12). The above graph is meant to show that it really doesn't matter what the value is, as long as it is something less than e-6 or so.
These considerations nevertheless suggest that it would be good to first check st_precision()
of the input data, and to take that value where it is > 0, otherwise to default to 1e-12 or whatever.
Good morning @mpadge, and (again) thank you very much for your time, help, and explanations. I have just one more consideration regarding the precision
argument. I tested the C++ code using projected coordinates (i.e. not 4326), and I think that, in those cases, we should adopt a different default value:
# packages
library(ggplot2)
# load cpp function
Rcpp::sourceCpp("C:/Users/Utente/Desktop/src.cpp")
# OSM data
iow <- sf::st_transform(osmextract::oe_get("Isle of Wight"), 27700)
nodes <- sfnetworks:::linestring_boundary_points(iow)
precision <- 10 ^ -(1:12)
nunique <- vapply (precision, function (i) {
length(unique (cpp_match (nodes, precision = i)))
},
integer (1)
)
x <- data.frame (
precision = precision,
n = nunique
)
ggplot (x, aes (x = precision, y = n)) +
geom_line () +
scale_x_log10 () +
scale_y_log10 ()
Created on 2021-03-16 by the reprex package (v1.0.0)
Maybe it's possible to modify the C++ code and merge the two scenarios (I don't know), but I just wanted to highlight that (at the moment) the default value of precision
should take into account the unit of measurement of input data (meter
or degree
)
That can be handled straight in R by something like the following lines, which estimate how many digits are used to represent the whole-number part of coordinates, as well as the maximal possible number that can be used to store integer values.
n <- vapply (nodes [[1]], function (i)
nchar (as.character (round (i))),
integer (1))
nmax <- nchar (as.character (.Machine$integer.max))
For 4326, n
is (2, 2), whereas for 27700, it is (6, 5). nmax
is 10, and it would generally suffice to then set precision
= 10 ^ (-(nmax - n + extra_bit))
, where extra_bit
might be 1 or 2 just to make sure. With extra_bit = 2
, you'd then get precision = 1e-10
for 4326, and 1e-6
for 27700. Note that the C++ code uses a long int
which does not exist in R, and is much bigger than R's .Machine$integer.max
. This is nevertheless still machine dependent, and @agila5, I'd guess somehow your machine is doing something strange with that, because n
should never increase like in your result. I just get a perfectly flat line from your code, with precision having no effect. You are clearly seeing an overflow effect, where n
increases because your values for small precision start getting rounded to some upper limit of your machine representation of integer values. This is definitely a machine effect on your side, and illustrates precisely the kind of reason why explicitly considering and controlling precision is really important. The above code snippets should do the job.
This is nevertheless still machine dependent, and @agila5, I'd guess somehow your machine is doing something strange with that, because n should never increase like in your result. I just get a perfectly flat line from your code, with precision having no effect.
I think it must be related to the fact that I use Windows on my laptop because when I run the same code on a virtual machine (running ubuntu 16.04) I observe the same behaviour that you describe (i.e. a flat line going down for smaller values of precision).
Just for "fun", I tried to test those problems a little bit more (since I really don't want to update sfnetworks
with something that makes my laptop useless 😅), and these are the results. It's a long comment, sorry about that. I add as much verbose output and details as possible (in case they are somehow useful).
On Ubuntu 16.04 I observe the same behaviour that you describe, i.e. C++ code uses a long int which does not exist in R and is much bigger than R's .Machine$integer.max
.
# load cpp function
Rcpp::sourceCpp(
code = '
#include <Rcpp.h>
// [[Rcpp::export]]
int cpp_match (double precision = 1e-10) {
const long int prec_int = round (1 / precision);
Rcpp::Rcout << "prec_int is equal to: " << prec_int << std::endl;
return prec_int;
}',
verbose = TRUE
)
#>
#> Generated extern "C" functions
#> --------------------------------------------------------
#>
#>
#> #include <Rcpp.h>
#> // cpp_match
#> int cpp_match(double precision);
#> RcppExport SEXP sourceCpp_1_cpp_match(SEXP precisionSEXP) {
#> BEGIN_RCPP
#> Rcpp::RObject rcpp_result_gen;
#> Rcpp::RNGScope rcpp_rngScope_gen;
#> Rcpp::traits::input_parameter< double >::type precision(precisionSEXP);
#> rcpp_result_gen = Rcpp::wrap(cpp_match(precision));
#> return rcpp_result_gen;
#> END_RCPP
#> }
#>
#> Generated R functions
#> -------------------------------------------------------
#>
#> `.sourceCpp_1_DLLInfo` <- dyn.load('/tmp/RtmpmMDLp0/sourceCpp-x86_64-pc-linux-gnu-1.0.6/sourcecpp_83071fec14b1/sourceCpp_2.so')
#>
#> cpp_match <- Rcpp:::sourceCppFunction(function(precision = 1e-10) {}, FALSE, `.sourceCpp_1_DLLInfo`, 'sourceCpp_1_cpp_match')
#>
#> rm(`.sourceCpp_1_DLLInfo`)
#>
#> Building shared library
#> --------------------------------------------------------
#>
#> DIR: /tmp/RtmpmMDLp0/sourceCpp-x86_64-pc-linux-gnu-1.0.6/sourcecpp_83071fec14b1
#>
#> /usr/lib/R/bin/R CMD SHLIB -o 'sourceCpp_2.so' 'file830749bf89dd.cpp'
res <- vapply(
X = 10 ^ (-(0:12)),
FUN = cpp_match,
FUN.VALUE = integer(1)
)
#> prec_int is equal to: 1
#> prec_int is equal to: 10
#> prec_int is equal to: 100
#> prec_int is equal to: 1000
#> prec_int is equal to: 10000
#> prec_int is equal to: 100000
#> prec_int is equal to: 1000000
#> prec_int is equal to: 10000000
#> prec_int is equal to: 100000000
#> prec_int is equal to: 1000000000
#> prec_int is equal to: 10000000000
#> prec_int is equal to: 100000000000
#> prec_int is equal to: 1000000000000
res
#> [1] 1 10 100 1000 10000 100000
#> [7] 1000000 10000000 100000000 1000000000 1410065408 1215752192
#> [13] -727379968
Created on 2021-03-17 by the reprex package (v0.3.0)
On the other hand, on Windows:
# load cpp function
Rcpp::sourceCpp(
code = '
#include <Rcpp.h>
// [[Rcpp::export]]
int cpp_match (double precision = 1e-10) {
const long int prec_int = round (1 / precision);
Rcpp::Rcout << "prec_int is equal to: " << prec_int << std::endl;
return prec_int;
}',
verbose = TRUE
)
#>
#> Generated extern "C" functions
#> --------------------------------------------------------
#>
#>
#> #include <Rcpp.h>
#> // cpp_match
#> int cpp_match(double precision);
#> RcppExport SEXP sourceCpp_1_cpp_match(SEXP precisionSEXP) {
#> BEGIN_RCPP
#> Rcpp::RObject rcpp_result_gen;
#> Rcpp::RNGScope rcpp_rngScope_gen;
#> Rcpp::traits::input_parameter< double >::type precision(precisionSEXP);
#> rcpp_result_gen = Rcpp::wrap(cpp_match(precision));
#> return rcpp_result_gen;
#> END_RCPP
#> }
#>
#> Generated R functions
#> -------------------------------------------------------
#>
#> `.sourceCpp_1_DLLInfo` <- dyn.load('C:/Users/Utente/AppData/Local/Temp/RtmpSUEJy1/sourceCpp-x86_64-w64-mingw32-1.0.6/sourcecpp_2bd017893460/sourceCpp_2.dll')
#>
#> cpp_match <- Rcpp:::sourceCppFunction(function(precision = 1e-10) {}, FALSE, `.sourceCpp_1_DLLInfo`, 'sourceCpp_1_cpp_match')
#>
#> rm(`.sourceCpp_1_DLLInfo`)
#>
#> Building shared library
#> --------------------------------------------------------
#>
#> DIR: C:/Users/Utente/AppData/Local/Temp/RtmpSUEJy1/sourceCpp-x86_64-w64-mingw32-1.0.6/sourcecpp_2bd017893460
#>
#> C:/PROGRA~1/R/R-40~1.4/bin/x64/R CMD SHLIB -o "sourceCpp_2.dll" "file2bd053da4388.cpp"
res <- vapply(
X = 10 ^ (-(0:12)),
FUN = cpp_match,
FUN.VALUE = integer(1)
)
#> prec_int is equal to: 1
#> prec_int is equal to: 10
#> prec_int is equal to: 100
#> prec_int is equal to: 1000
#> prec_int is equal to: 10000
#> prec_int is equal to: 100000
#> prec_int is equal to: 1000000
#> prec_int is equal to: 10000000
#> prec_int is equal to: 100000000
#> prec_int is equal to: 1000000000
#> prec_int is equal to: -2147483648
#> prec_int is equal to: -2147483648
#> prec_int is equal to: -2147483648
res
#> [1] 1 10 100 1000 10000 100000
#> [7] 1000000 10000000 100000000 1000000000 NA NA
#> [13] NA
Created on 2021-03-17 by the reprex package (v1.0.0)
Unfortunately, that also implies that I cannot use cpp_match
on my laptop when precision
is greater or equal to 10^-9
. For example (with CRS: 4326):
# load cpp function
Rcpp::sourceCpp("src.cpp") # the same code that you shared
# load data
roads <- sfnetworks::roxel
# extract nodes
nodes <- sfnetworks:::linestring_boundary_points(roads)
table(cpp_match(nodes), useNA = "ifany")
#>
#> 1
#> 1702
Created on 2021-03-17 by the reprex package (v1.0.0)
Does it make sense to use long long int
or unsigned long long int
? I really don't know anything about C types, but I read on the wiki page that they should support bigger numbers (stored as exact integers). For example (on Windows):
# load cpp function
Rcpp::sourceCpp(
code = '
#include <Rcpp.h>
// [[Rcpp::export]]
int cpp_match (double precision = 1e-10) {
const unsigned long long int prec_int = round (1 / precision);
Rcpp::Rcout << "prec_int is equal to: " << prec_int << std::endl;
return prec_int;
}',
verbose = TRUE
)
#>
#> Generated extern "C" functions
#> --------------------------------------------------------
#>
#>
#> #include <Rcpp.h>
#> // cpp_match
#> int cpp_match(double precision);
#> RcppExport SEXP sourceCpp_1_cpp_match(SEXP precisionSEXP) {
#> BEGIN_RCPP
#> Rcpp::RObject rcpp_result_gen;
#> Rcpp::RNGScope rcpp_rngScope_gen;
#> Rcpp::traits::input_parameter< double >::type precision(precisionSEXP);
#> rcpp_result_gen = Rcpp::wrap(cpp_match(precision));
#> return rcpp_result_gen;
#> END_RCPP
#> }
#>
#> Generated R functions
#> -------------------------------------------------------
#>
#> `.sourceCpp_1_DLLInfo` <- dyn.load('C:/Users/Utente/AppData/Local/Temp/RtmpE7A6zT/sourceCpp-x86_64-w64-mingw32-1.0.6/sourcecpp_39b04576172f/sourceCpp_2.dll')
#>
#> cpp_match <- Rcpp:::sourceCppFunction(function(precision = 1e-10) {}, FALSE, `.sourceCpp_1_DLLInfo`, 'sourceCpp_1_cpp_match')
#>
#> rm(`.sourceCpp_1_DLLInfo`)
#>
#> Building shared library
#> --------------------------------------------------------
#>
#> DIR: C:/Users/Utente/AppData/Local/Temp/RtmpE7A6zT/sourceCpp-x86_64-w64-mingw32-1.0.6/sourcecpp_39b04576172f
#>
#> C:/PROGRA~1/R/R-40~1.4/bin/x64/R CMD SHLIB -o "sourceCpp_2.dll" "file39b067833a46.cpp"
res <- vapply(
X = 10 ^ (-(0:16)),
FUN = cpp_match,
FUN.VALUE = integer(1)
)
#> prec_int is equal to: 1
#> prec_int is equal to: 10
#> prec_int is equal to: 100
#> prec_int is equal to: 1000
#> prec_int is equal to: 10000
#> prec_int is equal to: 100000
#> prec_int is equal to: 1000000
#> prec_int is equal to: 10000000
#> prec_int is equal to: 100000000
#> prec_int is equal to: 1000000000
#> prec_int is equal to: 10000000000
#> prec_int is equal to: 100000000000
#> prec_int is equal to: 1000000000000
#> prec_int is equal to: 10000000000000
#> prec_int is equal to: 100000000000000
#> prec_int is equal to: 1000000000000000
#> prec_int is equal to: 10000000000000000
res
#> [1] 1 10 100 1000 10000 100000
#> [7] 1000000 10000000 100000000 1000000000 1410065408 1215752192
#> [13] -727379968 1316134912 276447232 -1530494976 1874919424
Created on 2021-03-17 by the reprex package (v1.0.0)
I have no idea about the drawbacks.
Wow, thanks @agila5, those results on windows are really ... interesting. The short answer to your Q is, Yes, you can easily just use long long int
instead of long int
, and there are no drawbacks at all. There shouldn't even be any speed drawbacks.
A slight longer answer: if you really wanted to ensure that all was truly okay back over in R, you could determine a precision value like this:
xymax <- max (abs (st_coordinates (nodes)))
precision <- .Machine$integer.max / xymax
precision <- nchar (as.character (round (precision)))
extra_bit <- 2 # or whatever
precision <- 1 / 10 ^ (precision - extra_bit)
But using long long int
would mean you shouldn't even need that. These long int
/ long long int
things only have to exist in C++ to determine whether nodal coordinates are "equal" or not, and have nothing to do with R, so it doesn't matter that your values returned to R are complete rubbish, as long as the C++ representations are good.
Just to fix the bug for the time being I am now using st_match
instead of match
. This takes the "high priority" of this issue. Still interested to add the C++ version, but need more time to look into it (which I am currently short of!).
No worries @luukvdmeer, time is precious. As said, the C++ code is all there, so happy to PR when you've got time to review it - just let me know.
Describe the bug There is a bug in
create_nodes_from_edges()
when the input coordinates are really really really close but not exactly "identical" (I'm not even sure how to define "identical" here).Reproducible example
The problem occurs in
create_nodes_from_edges()
and, more precisely, here: https://github.com/luukvdmeer/sfnetworks/blob/eba28b706ae0a5476b946b67c5a25a7fc1591006/R/utils.R#L68-L72 In fact,has only three rows. Please note that the last line of code was taken from https://github.com/luukvdmeer/sfnetworks/blob/eba28b706ae0a5476b946b67c5a25a7fc1591006/R/utils.R#L81 Anyway, that problem implies that the IDs used by
tidygraph
are wrong, creating an error. Hence, sooner or later I think we should remove the code that compares the equality of numerical values (which is also the problem flagged by @mpadge here).I know that it sounds like a pathological example, but I found the problem while running some R code for a GIS SE question.
Expected behavior No error.
R Session Info
Session info
``` r devtools::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 3.6.3 (2020-02-29) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate Italian_Italy.1252 #> ctype Italian_Italy.1252 #> tz Europe/Berlin #> date 2021-02-21 #> #> - Packages ------------------------------------------------------------------- #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) #> callr 3.5.1 2020-10-13 [1] CRAN (R 3.6.3) #> class 7.3-17 2020-04-26 [1] CRAN (R 3.6.3) #> classInt 0.4-3 2020-04-07 [1] CRAN (R 3.6.3) #> cli 2.3.0 2021-01-31 [1] CRAN (R 3.6.3) #> colorspace 2.0-0 2020-11-11 [1] CRAN (R 3.6.3) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 3.6.3) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 3.6.3) #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) #> devtools 2.3.2 2020-09-18 [1] CRAN (R 3.6.3) #> digest 0.6.27 2020-10-24 [1] CRAN (R 3.6.3) #> dplyr 1.0.4 2021-02-02 [1] CRAN (R 3.6.3) #> e1071 1.7-4 2020-10-14 [1] CRAN (R 3.6.3) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.3) #> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0) #> fs 1.5.0 2020-07-31 [1] CRAN (R 3.6.3) #> generics 0.1.0 2020-10-31 [1] CRAN (R 3.6.3) #> ggplot2 3.3.2 2020-06-19 [1] CRAN (R 3.6.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 3.6.3) #> gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0) #> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0) #> htmltools 0.5.0 2020-06-16 [1] CRAN (R 3.6.3) #> igraph 1.2.6 2020-10-06 [1] CRAN (R 3.6.3) #> KernSmooth 2.23-18 2020-10-29 [1] CRAN (R 3.6.3) #> knitr 1.30 2020-09-22 [1] CRAN (R 3.6.3) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 3.6.3) #> lwgeom 0.2-5 2020-06-12 [1] CRAN (R 3.6.3) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 3.6.3) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0) #> pillar 1.4.7 2020-11-20 [1] CRAN (R 3.6.3) #> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 3.6.3) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.1) #> pkgload 1.1.0 2020-05-29 [1] CRAN (R 3.6.3) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2) #> processx 3.4.4 2020-09-03 [1] CRAN (R 3.6.3) #> ps 1.4.0 2020-10-07 [1] CRAN (R 3.6.3) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 3.6.3) #> R6 2.5.0 2020-10-28 [1] CRAN (R 3.6.3) #> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 3.6.3) #> remotes 2.2.0 2020-07-21 [1] CRAN (R 3.6.3) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 3.6.3) #> rmarkdown 2.7 2021-02-19 [1] CRAN (R 3.6.3) #> rprojroot 2.0.2 2020-11-15 [1] CRAN (R 3.6.3) #> scales 1.1.1 2020-05-11 [1] CRAN (R 3.6.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) #> sf * 0.9-7 2021-01-06 [1] CRAN (R 3.6.3) #> sfheaders 0.4.0 2020-12-01 [1] CRAN (R 3.6.3) #> sfnetworks * 0.5.0 2021-02-21 [1] local #> stringi 1.5.3 2020-09-09 [1] CRAN (R 3.6.3) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0) #> testthat 3.0.0 2020-10-31 [1] CRAN (R 3.6.3) #> tibble 3.0.6 2021-01-29 [1] CRAN (R 3.6.3) #> tidygraph 1.2.0 2020-05-12 [1] CRAN (R 3.6.3) #> tidyr 1.1.2 2020-08-27 [1] CRAN (R 3.6.3) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3) #> units 0.6-7 2020-06-13 [1] CRAN (R 3.6.3) #> usethis 2.0.0 2020-12-10 [1] CRAN (R 3.6.3) #> vctrs 0.3.6 2020-12-17 [1] CRAN (R 3.6.3) #> withr 2.3.0 2020-09-22 [1] CRAN (R 3.6.3) #> xfun 0.19 2020-10-30 [1] CRAN (R 3.6.3) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2) #> #> [1] C:/Users/Utente/Documents/R/win-library/3.6 #> [2] C:/Program Files/R/R-3.6.3/library ```