Morpher to remove pseudo-nodes

meltzerdl commented 4 years ago

Great functions in this package! I'm trying to find a way to remove pseudonodes. I want to take a road dataset and create two data outputs based on it: road segments and intersections. However, it's important that pseudonodes are removed in creating both outputs. In my case, I am defining pseudonodes as anywhere only two line segments meet at a node (regardless of whether attributes are the same or different for the two segments). So, I would need to take the roads network, merge segments with pseudonodes, and then produce a set of points at all road intersections (i.e. where 3 or more road segments meet).

I've used stplanr for this in the following way, with success:

Roads <- st_cast(roads_input_data, "MULTILINESTRING") %>% st_combine() %>% as_Spatial() %>% gsection() #from stplanr Roads_output <- st_as_sf(Roads) Nodes_output <- SpatialLinesNetwork(Roads) %>% #from stplanr sln2points() %>% #from stplanr st_as_sf()

However, for large statewide roads datasets this process is so slow as to be impractical and I'm looking for a faster way. If I could run it in parallel, cutting up the dataset by county (or smaller) and then putting the final dataset together, that could speed it up. however, in this case I would still need a way to remove pseudonodes produced by process of cutting up the roads data.

So, ideally it would be nice to have a function to remove pseudonodes; i.e. any node where there are only two line segments meeting, regardless of attributes. Could have an option to select certain attributes must match, as some folks might find that important.

Thank you!

agila5 commented 4 years ago

HI @meltzerdl, I think it's a really interesting problem (somehow linked to https://github.com/ropensci/stplanr/issues/416). I hope I will be able to work a little bit on this problem in the next days and update this comment with something more useful.

luukvdmeer commented 4 years ago

Just to get it clear:

In a situation A --- B --- C, where A, B and C are nodes and --- are edges. Lets say A and C have some more edges attached to them, but B is a pseudo-node (in your definition). Then, you want to remove B, but keep a connection between A and C, right?

EDIT That is, the problem explained here. The given answers there seem to imply that igraph does not have a built-in function for that operation.

meltzerdl commented 4 years ago

Yes, so the idea would be that the two edges (between A and B, and between B and C) would be dissolved into one edge, and node B would be removed.

Interesting, I hadn't seen that explanation about an igraph answer.

luukvdmeer commented 4 years ago

Sorry this took me some time.

From what I found here, in a graph with nodes A, B and C and edges A-B and B-C, the process of removing node B and replacing edges A-B and B-C by a single edge A-C is called smoothing. Does that sound familiar to anyone, or is there a better name?

For now, in the develop branch I implemented a first version of this, in the form of a spatial morpher called to_spatial_smoothed. Its functionality is showcased in the reprex below. It is simply iterating over all pseudo nodes and removing them one by one. It is not optimized for performance or anything, so might be slow on large graphs. Lets first test if it works good!

Note that you do need the development version to run this. Install with remotes::install_github("luukvdmeer/sfnetworks", ref = "develop").

library(sfnetworks)

# Create a very simple network with some connected linestrings in Roxel.
net = as_sfnetwork(roxel[c(4, 5, 8), ], directed = FALSE)
net
#> # An sfnetwork with 4 nodes and 3 edges
#> #
#> # CRS:  EPSG:4326 
#> #
#> # An unrooted tree with spatially explicit edges
#> #
#> # Node Data:     4 x 1 (active)
#> # Geometry type: POINT
#> # Dimension:     XY
#> # Bounding box:  xmin: 7.537614 ymin: 51.94468 xmax: 7.540063 ymax: 51.9475
#>              geometry
#>           <POINT [°]>
#> 1 (7.540063 51.94468)
#> 2  (7.53822 51.94546)
#> 3  (7.537673 51.9475)
#> 4 (7.537614 51.94562)
#> #
#> # Edge Data:     3 x 5
#> # Geometry type: LINESTRING
#> # Dimension:     XY
#> # Bounding box:  xmin: 7.537614 ymin: 51.94468 xmax: 7.540063 ymax: 51.9475
#>    from    to name      type                                            geometry
#>   <int> <int> <fct>     <fct>                                   <LINESTRING [°]>
#> 1     1     2 <NA>      path    (7.540063 51.94468, 7.539696 51.94479, 7.539466…
#> 2     3     4 Welsingh… reside…            (7.537673 51.9475, 7.537614 51.94562)
#> 3     2     4 <NA>      path    (7.53822 51.94546, 7.538131 51.94549, 7.538027 …

# The network has four nodes, of which two are 'pseudo'.
plot(net)


# The 'to_spatial_smoothed' morpher function removes these pseudo nodes.
# A new edge is be created between the remaining two nodes.
# The geometry of that edge is equal to the merged orignal edges.
net_smoothed = tidygraph::convert(net, to_spatial_smoothed)
net_smoothed
#> # An sfnetwork with 2 nodes and 1 edges
#> #
#> # CRS:  EPSG:4326 
#> #
#> # An unrooted tree with spatially explicit edges
#> #
#> # Node Data:     2 x 2 (active)
#> # Geometry type: POINT
#> # Dimension:     XY
#> # Bounding box:  xmin: 7.537672 ymin: 51.94468 xmax: 7.540063 ymax: 51.9475
#>              geometry .tidygraph_node_index
#>           <POINT [°]>                 <int>
#> 1 (7.540063 51.94468)                     1
#> 2  (7.537673 51.9475)                     3
#> #
#> # Edge Data:     1 x 5
#> # Geometry type: LINESTRING
#> # Dimension:     XY
#> # Bounding box:  xmin: 7.537614 ymin: 51.94468 xmax: 7.540063 ymax: 51.9475
#>    from    to                             geometry .tidygraph_edge_… .orig_data 
#>   <int> <int>                     <LINESTRING [°]> <list>            <list>     
#> 1     1     2 (7.537673 51.9475, 7.537614 51.9456… <int [3]>         <tibble [3…

plot(net_smoothed)


# In line with tidygraphs morphers:
# The data of the original edges are stored in a '.orig_data' column.
# This behaviour can be turned off by setting 'store_orignal_data = FALSE'.
net_smoothed %>% 
  activate("edges") %>% 
  tidygraph::pull(".orig_data")
#> [[1]]
#> Simple feature collection with 3 features and 5 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 7.537614 ymin: 51.94468 xmax: 7.540063 ymax: 51.9475
#> CRS:            EPSG:4326
#> # A tibble: 3 x 6
#>    from    to name    type                             geometry .tidygraph_edge…
#>   <int> <int> <fct>   <fct>                    <LINESTRING [°]>            <int>
#> 1     3     4 Welsin… reside… (7.537673 51.9475, 7.537614 51.9…                2
#> 2     1     2 <NA>    path    (7.540063 51.94468, 7.539696 51.…                1
#> 3     2     4 <NA>    path    (7.53822 51.94546, 7.538131 51.9…                3

# An additional parameter that can be set is 'require_equal_attrs'.
# This will only merge edges when their attributes are equal (default = FALSE).
# In our example network they were not.
# So smoothing does not remove any nodes.
tidygraph::convert(net, to_spatial_smoothed, require_equal_attrs = TRUE) %>%
  plot()

^{Created on 2020-09-27 by the reprex package (v0.3.0)}

Robinlovelace commented 4 years ago

Just tried this and I confirm, it works. It looks like a solid implementation to me. I tried the following:

remotes::install_github("luukvdmeer/sfnetworks", ref = "develop")

library(sfnetworks)

# Create a very simple network with some connected linestrings in Roxel.
net = as_sfnetwork(roxel[c(4, 5, 8), ], directed = FALSE)
net
plot(net)

net_smoothed = tidygraph::convert(net, to_spatial_smoothed)
net_smoothed
plot(net_smoothed)

net_smoothed %>% 
  activate("edges") %>% 
  tidygraph::pull(".orig_data")
tidygraph::convert(net, to_spatial_smoothed) %>%
  plot()

Confirmed, without the , require_equal_attrs = TRUE it returns the 'smoothed' output. I think can also be called graph contraction. It seems a number of algorithms implementing the 'contraction hierarchies' technique have been developed: https://en.wikipedia.org/wiki/Contraction_hierarchies

Is that the same thing? I guess 'removal of pseudo-nodes' is a particular type of contraction. @fiftysevendegreesofrad, developer of the sDNA package for spatial network analysis, has written some excellent documentation on the wider topic of network preparation: https://sdna.cardiff.ac.uk/sdna/wp-content/downloads/documentation/manual/sDNA_manual_v3_4_5/network_preparation.html. Would be interested in your thoughts on the terminology Crispin, would you call this pseudo-node removal, smoothing, contraction or something else? Also heads-up as I suspect it may be the kind of thing that you or others have already implemented in a lower level and potentially faster-iterating language such as C++.

fiftysevendegreesofrad commented 4 years ago

Presume we're talking about removing nodes with only 2 links joining. The accepted terminology for this seems to be removing pseudo-nodes these days. Though in sDNA I call it fixing split links (and yes there is a c++ implementation in https://github.com/fiftysevendegreesofrad/sdna_open though it leans heavily on sDNA's internal network representation so may not be that useful to you).

Incidentally one thing to consider in such operations is what happens to data attached to the links - you can choose to handle this externally of course using GIS join operations for example, or assume numerical data is summed or averaged (if the latter, weighted or not by unit length), and then there's the question of what happens to text...

luukvdmeer commented 3 years ago

Updated version of this. Now named to_spatial_smooth. More performant, although still not super fast. In the end you have to iterate over all nodes, there is no way out of that. Runs for roxel (700 nodes and 850 edges) in two seconds. Of course, this includes pasting together linestring geometries of merged edges. When running on roxel with spatially implicit edges (i..e no geometries for edges) it runs twice as fast.

suppressPackageStartupMessages({
  library(sfnetworks)
  library(tidygraph)
  library(igraph)
})

net = as_sfnetwork(roxel)
paste(vcount(net), "nodes and", ecount(net), "edges")
#> [1] "701 nodes and 851 edges"

# Without requiring equal attributes.
smoothed = convert(net, to_spatial_smooth)
paste(vcount(smoothed), "nodes and", ecount(smoothed), "edges")
#> [1] "652 nodes and 802 edges"
plot(net, cex = 0.8, col = "red")
plot(smoothed, add = TRUE)


# With requiring equal attributes.
smoothed = convert(net, to_spatial_smooth, require_equal_attrs = TRUE)
paste(vcount(smoothed), "nodes and", ecount(smoothed), "edges")
#> [1] "675 nodes and 825 edges"
plot(net, cex = 0.8, col = "red")
plot(smoothed, add = TRUE)

^{Created on 2020-11-04 by the reprex package (v0.3.0)}

agila5 commented 3 years ago

Hi @luukvdmeer! Would you consider adding a progress bar to this new morpher? I think it could be created using txtProgressBar(), setting pb = sum(!pseudo_remaining). I'm running this morpher on a slightly big network (15k edges) and it has been running for the last hour. The problem is that I have no idea if it will finish in a few seconds or in a few hours.

If you like the idea I can add a PR during the weekend.

luukvdmeer commented 3 years ago

Yes, if that works would be great! I was already scared for performance on large networks.. Could you try it also with the same network but after removing edge geometries (i.e. spatially implicit edges)? Then we can see if the geometry splitting is the part that really slows things down (and might consider to use your implementation of st_split if that makes a significant difference..?).

EDIT: Sorry, I was confused. There are no geometries splitted in this function. Geometries are concatenated.

agila5 commented 3 years ago

Hi @luukvdmeer, I was thinking about this problem and I have one (absolutely not urgent) question. Here you say that:

In the end you have to iterate over all nodes, there is no way out of that.

I don't understand why you need to iterate over all nodes instead of applying something like:

Calculate the degree of all nodes and then filter only the nodes with degree = 2 (or use the "mode" argument for directed networks);
Find the edges that are incident to those nodes and merge them (maybe add extra criteria according to other attributes);
Rebuild the sfnetworks structure.

I did not try to implement the algorithm described above (so maybe there are some obvious drawbacks, I'm not sure), but I don't understand why you need to check the nodes one by one. I just want to discuss these ideas with you before spending several days on something that you have already explored.

EDIT: Actually I just checked the code more carefully and that was a stupid question, sorry. If possible I will try to improve the step 2 of the algorithm (i.e. merging the edges).

idshklein commented 3 years ago

I think this works faster in osmnx. here is the function they are using (I didn't delve into it), maybe it can be used as inspiration: https://github.com/gboeing/osmnx/blob/7e46237dc888f864c052f5daf10348e64cc2a642/osmnx/simplification.py#L195

luukvdmeer commented 3 years ago

I did not try to implement the algorithm described above (so maybe there are some obvious drawbacks, I'm not sure)

I reasoned this would not work good in situation where there are chains of pseudo nodes. In a situation where we have A -- b -- C -- d -- E, with upper-case letters being 'junction' nodes and lower-case letters being pseudo nodes, it is indeed easy to just find and remove the pseudo nodes b and d separately and then create the new edges A -> C after removing b and C -> E after removing d. But when C is also a pseudo node, i.e. we have A -- b -- c -- d -- E, we have to identify that b, c and d form a chain of pseudo nodes and only add one new edge A -> E after removing them. Currently this is implemented in a way that we first process b and replace A -> b and b -> c by a new edge A -> c, then process c and replace A -> c and c -> d by a new edge A -> d, and finally process d and replace A -> d and d -> E by a new edge A -> E. But this iterative processing is not so fast..

I totally agree there must be more efficient ways to do this than the current 'dirty' implementation. Maybe do the looping only to find the indices of pseudo nodes and group them in chains, and then do the concatenation of geometries in one go afterwards. Something like that. But I did not give it enough thought yet.

Getting inspired by osmnx sounds like a good idea!

luukvdmeer commented 3 years ago

This is now part of the new version as the to_spatial_smooth() morpher. See this vignette for explanation.

It does exactly what is has to do, but I'll leave this issue open because we definitely need to improve its speed.

idshklein commented 3 years ago

I did not try to implement the algorithm described above (so maybe there are some obvious drawbacks, I'm not sure)

I reasoned this would not work good in situation where there are chains of pseudo nodes. In a situation where we have A -- b -- C -- d -- E, with upper-case letters being 'junction' nodes and lower-case letters being pseudo nodes, it is indeed easy to just find and remove the pseudo nodes b and d separately and then create the new edges A -> C after removing b and C -> E after removing d. But when C is also a pseudo node, i.e. we have A -- b -- c -- d -- E, we have to identify that b, c and d form a chain of pseudo nodes and only add one new edge A -> E after removing them. Currently this is implemented in a way that we first process b and replace A -> b and b -> c by a new edge A -> c, then process c and replace A -> c and c -> d by a new edge A -> d, and finally process d and replace A -> d and d -> E by a new edge A -> E. But this iterative processing is not so fast..

I totally agree there must be more efficient ways to do this than the current 'dirty' implementation. Maybe do the looping only to find the indices of pseudo nodes and group them in chains, and then do the concatenation of geometries in one go afterwards. Something like that. But I did not give it enough thought yet.

Getting inspired by osmnx sounds like a good idea!

btw, practically, I'm using python's osmnx to download the data I need and simplify pseudo nodes (it works really fast, didn't compare) and sfnetworks to analyze it, using reticulate in the middle. Maybe this should be in discussions, but a function to convert an osmnx network to sfnetworks may be required. Will try to provide script

Robinlovelace commented 3 years ago

@idshklein have you tried downloading networks with osmextract? Should be fast.

luukvdmeer commented 3 years ago

Ping @agila5 I managed to get this +/- 20 times faster. Can be tested!

agila5 commented 3 years ago

Hi @luukvdmeer! Thank you very much for the update, I will test the new version sometimes during the next days. Meanwhile, happy new year 🎉 🥳

loreabad6 commented 3 years ago

Some testing on the data I plan to use for my coming talk in Spanish on sfnetworks:

library(sfnetworks)
library(tidygraph)
library(igraph)

load('E:/Personal_projects/Rspatialnetworks/sfnetworks-REspacialES/data/cuenca.rda')

(net = as_sfnetwork(cuenca))
#> # A sfnetwork with 12283 nodes and 8999 edges
#> #
#> # CRS:  EPSG:4326 
#> #
#> # A directed multigraph with 3646 components with spatially explicit edges
#> #
#> # Node Data:     12,283 x 1 (active)
#> # Geometry type: POINT
#> # Dimension:     XY
#> # Bounding box:  xmin: -79.0708 ymin: -2.935008 xmax: -78.91145 ymax: -2.833441
#>                geometry
#>             <POINT [°]>
#> 1   (-79.028 -2.906154)
#> 2  (-79.02803 -2.90633)
#> 3 (-78.95973 -2.893506)
#> 4 (-78.96173 -2.893512)
#> 5 (-79.00801 -2.891764)
#> 6  (-79.00924 -2.89848)
#> # ... with 12,277 more rows
#> #
#> # Edge Data:     8,999 x 9
#> # Geometry type: LINESTRING
#> # Dimension:     XY
#> # Bounding box:  xmin: -79.08642 ymin: -2.960194 xmax: -78.90349 ymax:
#> #   -2.779669
#>    from    to name  type  lanes oneway sidewalk surface
#>   <int> <int> <chr> <chr> <chr> <chr>  <chr>    <chr>  
#> 1     1     2 <NA>  resi~ <NA>  no     <NA>     <NA>   
#> 2     3     4 <NA>  trun~ 1     yes    right    asphalt
#> 3     5     6 Juan~ resi~ 2     yes    both     sett   
#> # ... with 8,996 more rows, and 1 more variable: geometry <LINESTRING [°]>

# Number of pseudo-nodes (note that roxel has 49)
pseudo = degree(net, mode = "in") == 1 & degree(net, mode = "out") == 1
length(pseudo[pseudo])
#> [1] 1541

microbenchmark::microbenchmark(convert(net, to_spatial_smooth), times = 10)
#> Unit: seconds
#>                             expr      min       lq     mean   median       uq
#>  convert(net, to_spatial_smooth) 3.132143 3.406496 3.499677 3.488304 3.659073
#>       max neval
#>  3.711856    10

luukvdmeer commented 3 years ago

In the most recent release some improvements have been made again. There is still a bug left, but this is documented in a separate issue, see #112

Of course, new issues may still arise after the function gets used more often, but I think the current implementation is enough to close this one.

luukvdmeer / sfnetworks

Morpher to remove pseudo-nodes #70