ropensci / stplanr

Sustainable transport planning with R
https://docs.ropensci.org/stplanr
Other
417 stars 66 forks source link

rnet_combine_short_segments #359

Open Robinlovelace opened 4 years ago

Robinlovelace commented 4 years ago

For some types of analysis on networks, very short segments are undesirable. As discussed with @agila5, it would be useful to have a function that could go in the opossite direction of rnet_breakub_vertices() and join short segments (to the shortest segment it touches is plan A).

Robinlovelace commented 4 years ago

Idea: could also create a function that divides segments that are above a threshold length.

mpadge commented 4 years ago

That's a really good idea. I would suggest using either a threshold proportion - so merge the shortest X% of segments, or a maximal threshold of numbers of segments, giving absolute control over edge reduction in a graph. I would also suggest that threshold length is perhaps not so useful, because, among other things, this varies in the OSM world depending on how well-mapped a location is, and that itself is a cultural phenomenon. So the effect of varying length thresholds would -- potentially -- end up being partly determined by cultural phenomena, which is ... non-ideal

Robinlovelace commented 4 years ago

Suggestion: user should specificy minimum threshold length, which they could do some analysis to find.

Robinlovelace commented 4 years ago

Demo of way to split long segments:

library(stplanr)
l = osm_net_example
l$length_units = sf::st_length(l)
l$length_ = as.numeric(l$length_units)
hist(log(l$length))

# biggest 
l_longest = l[which.max(l$length), ]
l_points1 = sf::st_sample(l_longest, size = 2)
l_points2 = sf::st_sample(l_longest, size = 2, type = "regular")
l_points3 = sf::st_sample(l_longest, size = 1, type = "regular")
l_points4 = sf::st_point_on_surface(l_longest)
# l_points2 = sf::st_line_sample(l_longest, n = 3) # fails
# l_points2 = sf::st_segmentize(l_longest, dfMaxLength = l_longest$length_units / 3)
# l_longest_segs = lwgeom::st_geod_segmentize(l_longest, max_seg_length = units::set_units(0.0000001, "rad"))
# l_points4 = line_midpoint(l_longest) # fails

plot(l_longest$geometry)
plot(l_points1, add = T)
plot(l_points2, add = T, col = "red")
plot(l_points3, add = T, col = "blue")
plot(l_points4, add = T, col = "green")
l_split = lwgeom::st_split(x = l_longest, y = l_points4) %>% 
  sf::st_collection_extract(type = "LINESTRING")
plot(l_split, col = 3:4)
agila5 commented 4 years ago

At the moment I'm working on this issue since it's the biggest problem for my statistical model, so this is an update on what I've done. Reprex of the state so far:

# packages
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(ggplot2)
library(stplanr)
library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union
library(lwgeom)
#> Linking to liblwgeom 2.5.0dev r16016, GEOS 3.6.1, PROJ 4.9.3
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

# user defined function
read_from_gpkg <- function(path, options = c("-f", "GPKG", "lines")) {
  gpkg_file <- paste0(tempfile(), ".gpkg")
  gdal_utils(
    util = "vectortranslate",
    source = path, 
    destination = gpkg_file, 
    options = options
  )
  res <- st_read(gpkg_file, quiet = TRUE)
  names(res)[which(names(res) == "geom")] <- "geometry"
  st_geometry(res) <- "geometry"
  res
}

# download .osm.pbf for greater london
download.file("https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf", "greater-london-latest.osm.pbf", mode = "wb")

and, following https://github.com/ITSLeeds/geofabric/issues/12,

# read .osm.pbf data
greater_london <- read_from_gpkg(
  path = "greater-london-latest.osm.pbf",
  options = c(
    "-f", "GPKG", 
    "-select", "highway", # SQL select
    "-where", "highway IN ('motorway', 'motorway_link', 'trunk', 'trunk_link', 
    'primary', 'primary_link', 'secondary', 'secondary_link', 'tertiary', 
    'tertiary_link')", # SQL where
    "-t_srs", "EPSG:27700", # st_transform
    "lines"
  )
)
greater_london
#> Simple feature collection with 34356 features and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 503245.6 ymin: 155814.2 xmax: 561650.5 ymax: 201530.4
#> epsg (SRID):    27700
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
#> First 10 features:
#>       highway                       geometry
#> 1     primary LINESTRING (526262.5 191956...
#> 2     primary LINESTRING (525236.4 190755...
#> 3     primary LINESTRING (526324.4 192518...
#> 4     primary LINESTRING (524912.4 190235...
#> 5     primary LINESTRING (527283.9 198571...
#> 6       trunk LINESTRING (518638.6 181066...
#> 7  trunk_link LINESTRING (518536.3 182623...
#> 8       trunk LINESTRING (523237.5 187566...
#> 9       trunk LINESTRING (528953.5 191787...
#> 10      trunk LINESTRING (530435.7 192081...

# plot
ggplot(greater_london) + 
  geom_sf()

and it seems reasonable I think. Now I need to clean the network a little bit since I cannot combine short segments belonging to different unconnected subgraphs (but it's not important for this comment).

greater_london <- stplanr::rnet_breakup_vertices(greater_london)
#> Splitting rnet object at the intersection points between nodes and internal vertexes
#> Splitting rnet object at the duplicated internal vertexes

# remove isolated subgraphs
greater_london_graph <- graph_from_adj_list(st_touches(greater_london))
greater_london_membership <- components(greater_london_graph)$membership
table(greater_london_membership) # wel...
#> greater_london_membership
#>     1     2     3     4     5     6     7     8 
#> 36580     3     1     1     1     1     1     1

greater_london <- greater_london[greater_london_membership == 1, ]

Now I have to check the segments length

segments_length <- st_length(greater_london)
round(quantile(segments_length, probs = seq(0, 1, 0.1)), 3)
#> Units: [m]
#>       0%      10%      20%      30%      40%      50%      60%      70% 
#>    0.060   11.369   18.092   25.631   34.809   48.423   71.383  113.273 
#>      80%      90%     100% 
#>  192.918  351.544 6313.991
hist(segments_length, breaks = 50)

and, IMO, it make sense to set the max LINESTRING distance at 350m. So now I have to split long segments

example <- greater_london[which.max(segments_length), ]
example <- st_segmentize(example, dfMaxLength = units::set_units(350, "m"))
st_length(example)
#> 6313.991 [m]

and these are the points where to split the long segment

split_points <- st_line_sample(example, density = 1/350) %>% st_cast("POINT")

# plot
ggplot(example) + 
  geom_sf() + 
  geom_sf(data = split_points)

So I can split now

result <- st_collection_extract(st_split(example, split_points), "LINESTRING")
result
#> Simple feature collection with 1 feature and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 551703.4 ymin: 192970.9 xmax: 556335.9 ymax: 197179.5
#> epsg (SRID):    27700
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
#>      highway                       geometry
#> 171 motorway LINESTRING (556335.9 192970...

st_geometry(result) == st_geometry(example) # doesn't work, why? 
#> [1] TRUE

but it's clear that it doesn't work. I think that the problem lies here

st_intersects(example, split_points)
#> Sparse geometry binary predicate list of length 1, where the predicate was `intersects'
#>  1: (empty)

as suggested here and here. I read from @Robinlovelace previous comment that there exists a function called st_point_on_surface() but the problem is that I can't understand how to use it (and there is no clear documentation neither in sf help page or in other SO question). For example

split_point_on_surface <- st_point_on_surface(example)
#> Warning in st_point_on_surface.sf(example): st_point_on_surface assumes
#> attributes are constant over geometries of x
st_intersects(example, split_point_on_surface)
#> Sparse geometry binary predicate list of length 1, where the predicate was `intersects'
#>  1: 1

but I don't know how that point is chosen and how to select the points in such a way that the max length is approx 350. Any suggestion?

Now I tried with several approaches. The first one uses st_nearest_points() function

split_points_nearest <- st_nearest_points(example, split_points %>% st_cast("POINT"))

but still

st_intersects(example, split_points_nearest)
#> Sparse geometry binary predicate list of length 1, where the predicate was `intersects'
#>  1: (empty)

so I set a precision value thinking that it may help solving the rounding approximation problem

st_precision(greater_london) <- 1
example <- greater_london[which.max(segments_length), ]
st_length(example) # it was 6313.991 before rounding
#> 6314.393 [m]

# points where to split the long segment
split_points <- st_line_sample(example, density = 1/350) %>% st_cast("POINT")

# st_nearst_points approach
split_points_nearest <- st_nearest_points(example, split_points)

but still ...

# lengths(st_intersects(example, split_points_nearest))
#> [1] 0

and even if I split

result <- st_collection_extract(st_split(example, split_points_nearest), "LINESTRING")
result
#> Simple feature collection with 1 feature and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 551703 ymin: 192971 xmax: 556336 ymax: 197180
#> epsg (SRID):    27700
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
#>      highway                       geometry
#> 171 motorway LINESTRING (556336 192971, ...

If I decrese the precision value (i.e. I allow for bigger rounding) then

st_precision(greater_london) <- 0.004
example <- greater_london[which.max(segments_length), ]
st_length(example) # it was 6313.991 before rounding
#> 7533.891 [m]

# points where to split the long segment
split_points <- st_line_sample(example, density = 1/350) %>% st_cast("POINT")

# st_nearst_points approach
split_points_nearest <- st_nearest_points(example, split_points)
lengths(st_intersects(example, split_points_nearest)) == length(split_points_nearest)
#> [1] TRUE

but still

result <- st_collection_extract(st_split(example, split_points_nearest), "LINESTRING")
result
#> Simple feature collection with 2 features and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 551750 ymin: 193000 xmax: 556250 ymax: 197250
#> epsg (SRID):    27700
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
#>        highway                       geometry
#> 171   motorway LINESTRING (556250 193000, ...
#> 171.1 motorway LINESTRING (554532.7 194467...

Another idea taken from one of the GIS SO post above was to create a small buffer around the split points

split_points_buffer <- st_buffer(
  split_points, 
  dist = units::set_units(1e-9, "m")
)

st_intersects(example, split_points_buffer)
#> Sparse geometry binary predicate list of length 1, where the predicate was `intersects'
#>  1: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ...

but it doesn't work

result <- st_collection_extract(st_split(example, split_points_buffer), "LINESTRING")
result
#> Simple feature collection with 45 features and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 551750 ymin: 193000 xmax: 556250 ymax: 197250
#> epsg (SRID):    27700
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
#> First 10 features:
#>        highway                       geometry
#> 171   motorway LINESTRING (556250 193000, ...
#> 171.1 motorway LINESTRING (556250 193171.2...
#> 171.2 motorway LINESTRING (556250 193171.2...
#> 171.3 motorway LINESTRING (556000 193263.7...
#> 171.4 motorway LINESTRING (556000 193263.7...
#> 171.5 motorway LINESTRING (555893.9 193500...
#> 171.6 motorway LINESTRING (555893.9 193500...
#> 171.7 motorway LINESTRING (555609.6 193640...
#> 171.8 motorway LINESTRING (555609.6 193640...
#> 171.9 motorway LINESTRING (555312.5 193750...
round(st_length(result), 2)
#> Units: [m]
#>  [1] 171.22   0.00 342.45   0.00 342.45   0.00 342.45   0.00 342.45   0.00
#> [11] 342.45   0.00 342.45   0.00 342.45   0.00 342.45   0.00 342.45   0.00
#> [21] 342.45   0.00 342.45   0.00 342.45   0.00 342.45   0.00 342.45   0.00
#> [31] 342.45   0.00 342.45   0.00 342.45   0.00 342.45   0.00 342.45   0.00
#> [41] 342.45   0.00 342.45   0.00 171.22

since each buffer intersects the example line in two points. The last idea that I could think of is the following: I look for exactly which of the POINTS of example are nearest to each of the split_point

idx <- st_nearest_feature(split_points, st_cast(example, "POINT"))
#> Warning in st_cast.sf(example, "POINT"): repeating attributes for all sub-
#> geometries for which they may not be constant

and then I split according to these points

splits_points2 <- st_cast(example, "POINT")[idx, ]
#> Warning in st_cast.sf(example, "POINT"): repeating attributes for all sub-
#> geometries for which they may not be constant

result <- st_collection_extract(st_split(example, splits_points2), "LINESTRING")
result
#> Simple feature collection with 1 feature and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 551750 ymin: 193000 xmax: 556250 ymax: 197250
#> epsg (SRID):    27700
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
#>      highway                       geometry
#> 171 motorway LINESTRING (556250 193000, ...

It doesn't work and I have no idea why. I think it may be related to the accuracy set

st_precision(greater_london) <- 0
example <- greater_london[which.max(segments_length), ]
split_points <- st_line_sample(example, density = 1/350) %>% st_cast("POINT")
idx <- st_nearest_feature(split_points, st_cast(example, "POINT"))
#> Warning in st_cast.sf(example, "POINT"): repeating attributes for all sub-
#> geometries for which they may not be constant
splits_points2 <- st_cast(example, "POINT")[idx, ]
#> Warning in st_cast.sf(example, "POINT"): repeating attributes for all sub-
#> geometries for which they may not be constant

# and split 
result <- st_collection_extract(st_split(example, splits_points2), "LINESTRING")
result
#> Simple feature collection with 19 features and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 551703.4 ymin: 192970.9 xmax: 556335.9 ymax: 197179.5
#> epsg (SRID):    27700
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
#> First 10 features:
#>        highway                       geometry
#> 171   motorway LINESTRING (551945.4 197037...
#> 171.1 motorway LINESTRING (556335.9 192970...
#> 171.2 motorway LINESTRING (556252.9 193087...
#> 171.3 motorway LINESTRING (555973.5 193377...
#> 171.4 motorway LINESTRING (555696.9 193562...
#> 171.5 motorway LINESTRING (555447.6 193708...
#> 171.6 motorway LINESTRING (555069.2 193959...
#> 171.7 motorway LINESTRING (554789.2 194173...
#> 171.8 motorway LINESTRING (554474 194429.6...
#> 171.9 motorway LINESTRING (554248.1 194624...
st_length(result)
#> Units: [m]
#>  [1] 280.3949 143.4941 403.6134 332.8478 288.6384 454.6112 352.1257 406.1746
#>  [9] 298.4489 201.5495 468.2519 383.1759 149.2269 496.1094 314.1785 301.7933
#> [17] 383.5335 500.6116 155.2112

ggplot(result) + 
  geom_sf(col = seq_len(nrow(result)), size = 1)

It seems to work but still I think I'm missing something and I'm not 100% fine with this approach. Any help or suggestion?

Created on 2019-12-16 by the reprex package (v0.3.0)

agila5 commented 4 years ago

I think that this is the best approach so far. At the beginning I repeat the same stuff as before.

# packages
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(ggplot2)
library(stplanr)
library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union
library(lwgeom)
#> Linking to liblwgeom 2.5.0dev r16016, GEOS 3.6.1, PROJ 4.9.3
library(stats19)
#> Data provided under OGL v3.0. Cite the source and link to:
#> www.nationalarchives.gov.uk/doc/open-government-licence/version/3/

# THE SAME STUFF AS BEFORE

# Now I have to check for segments length
segments_length <- st_length(greater_london)
round(quantile(segments_length, probs = seq(0, 1, 0.1)), 3)
#> Units: [m]
#>       0%      10%      20%      30%      40%      50%      60%      70% 
#>    0.060   11.369   18.092   25.631   34.809   48.423   71.383  113.273 
#>      80%      90%     100% 
#>  192.918  351.544 6313.991
hist(segments_length, breaks = 50)

and, IMO, it makes sense to set max distance at 350m. So now I have to split the long segments. A first example is

example <- greater_london[which.max(segments_length), ]
example <- st_segmentize(example, dfMaxLength = units::set_units(350, "m"))
st_length(example)
#> 6313.991 [m]

Now the idea is to use st_linesubstringfunction from lwgeom. The description of that function says: "get substring from linestring" and the parameters are x (the linestring), from(relative distance from the origin, in [0,1]), to (relative distance from the origin in [0, 1]), and tolerance, which is used when to snap nodes to lines (at least that's what I understood from the description). For example

st_linesubstring(example, 0, 0.2)
#> Simple feature collection with 1 feature and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 555368.2 ymin: 192970.9 xmax: 556335.9 ymax: 193758.7
#> epsg (SRID):    27700
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
#>      highway                       geometry
#> 171 motorway LINESTRING (556335.9 192970...

The problem is that the following doesn't work

st_linesubstring(example, c(0, 0.1), c(0.1, 0.2))
#> Error in CPL_linesubstring(x, from, to, tolerance): Expecting a single value: [extent=2].

so I need to manually define a sequence of from and to and loop through all the values. The length of the sequence should be such that the distance between one point and the next point is equal to the threshold, i.e.

steps <- seq(0, 1, by = 350 / units::drop_units(st_length(example)))
steps
#>  [1] 0.00000000 0.05543245 0.11086490 0.16629735 0.22172980 0.27716226
#>  [7] 0.33259471 0.38802716 0.44345961 0.49889206 0.55432451 0.60975696
#> [13] 0.66518941 0.72062187 0.77605432 0.83148677 0.88691922 0.94235167
#> [19] 0.99778412

the problem is that the last part of the original LINESTRING is droppped so I need to fix that manually

steps[length(steps)] <- 1

There is a problem here for all the segments that are slighlty longer than the threshold. For example if st_length(x) is 400m then

seq(0, 1, 350 / 400)
#> [1] 0.000 0.875

which should become c(0, 1), but also if st_length(x) = 600

seq(0, 1, 350 / 600)
#> [1] 0.0000000 0.5833333

which also turns into c(0, 1). Is this a big problem? IMO no, but it implies that I should lower the threshold since, in this case, the "real" threshold is 2 times the original threshold.

Anyway now I can create the sequence of from and to

from <- steps[-length(steps)]
to <- steps[-1]

loop through all the values

results <- vector("list", length(from))
for(i in seq_along(from)) {
  results[[i]] <- st_linesubstring(example, from[i], to[i])
}

and recreate back the original LINESTRING

result <- do.call("rbind", results)
st_length(result)
#> Units: [m]
#>  [1] 350.000 350.000 350.000 350.000 350.000 350.000 350.000 350.000 350.000
#> [10] 350.000 350.000 350.000 350.000 350.000 350.000 350.000 350.000 363.991

This is the result which looks good IMO

ggplot(result) + 
  geom_sf(col = seq(1, nrow(result)))

So now I need to exclude the original example LINESTRING from london network data and append to that dataset these new linestrings

greater_london2 <- greater_london[-which.max(segments_length), ]
greater_london2 <- rbind(greater_london2, result)

greater_london2_graph <- graph_from_adj_list(st_touches(greater_london2))
greater_london2_membership <- components(greater_london2_graph)$membership
table(greater_london2_membership)
#> greater_london2_membership
#>     1 
#> 36597

It still preserves the original structure of the graph! So now I will define a function for repeating these procedure for all the "too-long" LINESTRINGS.

my_split_long_linestrings <- function(x, max_length) {
  # define steps length
  steps <- seq(0, 1, by = max_length / units::drop_units(st_length(x)))

  # fix that problem of last LINESTRINGS part
  steps[length(steps)] <- 1

  # now I can create the sequence of from and to 
  from <- steps[-length(steps)]
  to <- steps[-1]

  # and loop through all the values 
  results <- vector("list", length(from))
  for(i in seq_along(from)) {
    results[[i]] <- st_linesubstring(x, from[i], to[i])
  }

  # and recreate back the original LINESTRING
  result <- do.call("rbind", results)
  result
}

For example

my_split_long_linestrings(example, 350)
#> Simple feature collection with 18 features and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 551703.4 ymin: 192970.9 xmax: 556335.9 ymax: 197179.5
#> epsg (SRID):    27700
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
#> First 10 features:
#>       highway                       geometry
#> 171  motorway LINESTRING (556335.9 192970...
#> 1711 motorway LINESTRING (556117.7 193243...
#> 1712 motorway LINESTRING (555849.3 193466...
#> 1713 motorway LINESTRING (555548.9 193646...
#> 1714 motorway LINESTRING (555252.7 193832...
#> 1715 motorway LINESTRING (554968.2 194036...
#> 1716 motorway LINESTRING (554691.7 194251...
#> 1717 motorway LINESTRING (554422.1 194474...
#> 1718 motorway LINESTRING (554160.2 194706...
#> 1719 motorway LINESTRING (553914.3 194955...

Now i can apply the same all idea to all "too-long segments"

idx_long_segments <- which(segments_length > units::set_units(350 / 2, "m"))
greater_london_long_segments <- greater_london[idx_long_segments, ]
greater_london_short_segments <- greater_london[-idx_long_segments, ]

results <- vector("list", nrow(greater_london_long_segments))

for(i in seq_len(nrow(greater_london_long_segments))) {
  results[[i]] <- my_split_long_linestrings(greater_london_long_segments[i, ], max_length = 175)
  if(i %% 1000 == 0) print(i)
}
#> [1] 1000
#> [1] 2000
#> [1] 3000
#> [1] 4000
#> [1] 5000
#> [1] 6000
#> [1] 7000

and merge the results

results <- do.call("rbind", results) # takes a while so I will need to rethink these steps
greater_london3 <- rbind(greater_london_short_segments, results)

Now i can check the results

greater_london3
#> Simple feature collection with 44326 features and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 503245.6 ymin: 155814.2 xmax: 561650.5 ymax: 201530.4
#> epsg (SRID):    27700
#> proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
#> First 10 features:
#>         highway                       geometry
#> 1       primary LINESTRING (526262.5 191956...
#> 2       primary LINESTRING (525236.4 190755...
#> 3       primary LINESTRING (526324.4 192518...
#> 6         trunk LINESTRING (518638.6 181066...
#> 7    trunk_link LINESTRING (518536.3 182623...
#> 8         trunk LINESTRING (523237.5 187566...
#> 9         trunk LINESTRING (528953.5 191787...
#> 12      primary LINESTRING (525279.4 195890...
#> 12.1    primary LINESTRING (525301.5 195873...
#> 13      primary LINESTRING (524806.2 196292...

ggplot(greater_london3) + 
  geom_sf()

which seems reasonable and

greater_london_graph3 <- graph_from_adj_list(st_touches(greater_london3))
greater_london_membership <- components(greater_london_graph)$membership
table(greater_london_membership)
#> greater_london_membership
#>     1     2     3     4     5     6     7     8 
#> 36580     3     1     1     1     1     1     1

Unfortunately it kinda ruins the graph component and I don't know why (probably some approximation problem). More importantly

segments_length <- st_length(greater_london3)
round(quantile(segments_length, probs = seq(0, 1, 0.1)), 3)
#> Units: [m]
#>      0%     10%     20%     30%     40%     50%     60%     70%     80%     90% 
#>   0.060  12.702  21.213  31.086  45.858  73.414 129.359 175.000 175.000 235.079 
#>    100% 
#> 349.983
hist(segments_length, breaks = 50)

That peak is an obvious consequence of that choice. I don't think that's a problem but, it that case, we can solve that problem choosing a "random" max_length, i.e. max_length = rnorm(1, 350, something).

Created on 2019-12-16 by the reprex package (v0.3.0)

Anyway I will wait for your feedbacks now because maybe I'm missing some obvious solution and tomorrow I will work on the "combine short segments" problem.

Robinlovelace commented 4 years ago

Hi @agila5 thanks for the great reproducible examples and analysis. Without having tested it, I think this solution looks elegant and effective:

my_split_long_linestrings <- function(x, max_length) {
  # define steps length
  steps <- seq(0, 1, by = max_length / units::drop_units(st_length(x)))

  # fix that problem of last LINESTRINGS part
  steps[length(steps)] <- 1

  # now I can create the sequence of from and to 
  from <- steps[-length(steps)]
  to <- steps[-1]

  # and loop through all the values 
  results <- vector("list", length(from))
  for(i in seq_along(from)) {
    results[[i]] <- st_linesubstring(x, from[i], to[i])
  }

  # and recreate back the original LINESTRING
  result <- do.call("rbind", results)
  result
}

The histogram results shows it's working. Is it all still routable? In summary: great work!

OwenTheAnalyst commented 3 years ago

Hello, Fairly new R user still having my mind blown by what can be done. I've found this really helpful example when trying to split any segments of road longer than 200m in a dataset I have from Ordnance Survey.

I've followed the code as above and have ended up with the following steps:

idx_long_segments <- which(segments_length > units::set_units(400 / 2, "m"))
oxford_long_segments <- oxford_wgs[idx_long_segments, ]
oxford_short_segments <- oxford_wgs[-idx_long_segments, ]

results <- vector("list", nrow(oxford_long_segments))

for(i in seq_len(nrow(oxford_long_segments))) {
  results[[i]] <- my_split_long_linestrings(oxford_long_segments[i, ], max_length = 200)
  if(i %% 1000 == 0) print(i)
} 

When running the segment I unfortunately get an error, which I can't understand: Error in CPL_linesubstring(x, from, to, tolerance) : geometry should be of LINE type

I've re-read the example given and the geometry of the line segments is the same as mine - LINESTRING. I've tried reprojecting from OSGB to WGS and this makes no difference either - still get the same error.

Really grateful of any help/advice on how to fix this! Thanks in advance...

agila5 commented 3 years ago

Hi @OwenTheAnalyst! I'm really happy that you are using this code (even if I completely forgot about this project, sorry). Anyway, I'm sorry but it's quite difficult to help you at the moment. Can you please add a reproducible example showing the code and the error?

OwenTheAnalyst commented 3 years ago

Morning @agila5 - thanks so much for the prompt response! In the process of trying to pull together the example I actually found my answer. For some reason the file I had created when importing the shapefile had created a number of rows with lists of geometries rather than a single geometry per row - which was then causing the error as the function is obviously expecting a single linestring per row (I think!). So all sorted now and ran so much quicker than the previous way I had tried to find to sort this out.

I would be really interested in hearing from anyone who managed to complete the original task in this thread - and do the reverse and add together linestrings smaller than x.

Thanks again for the response!

agila5 commented 3 years ago

Hello again @OwenTheAnalyst

In the process of trying to pull together the example I actually found my answer.

That's the best feature of a reprex!

I would be really interested in hearing from anyone who managed to complete the original task in this thread - and do the reverse and add together linestrings smaller than x.

Today I run some tests, and I think that a cleaner solution for this problem could be implemented as follows (using a combination of st_geod_segmentize and st_subdivide):

  1. Load packages
library(sf)
#> Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(lwgeom)
#> Linking to liblwgeom 3.0.0beta1 r16016, GEOS 3.9.0, PROJ 7.2.1
  1. Load data
temp_path <- tempfile(fileext = ".osm.pbf")
download.file(
  url = "http://download.openstreetmap.fr/extracts/europe/vatican_city.osm.pbf", 
  destfile = temp_path, 
  mode = "wb"
)
  1. Read-in data
vatican <- st_read(temp_path, layer = "lines")
#> Reading layer `lines' from data source 
#>   `C:\Users\Utente\AppData\Local\Temp\RtmpOAG7Lh\file2e4cfc93b78.osm.pbf' 
#>   using driver `OSM'
#> Simple feature collection with 2424 features and 9 fields
#> Geometry type: LINESTRING
#> Dimension:     XY
#> Bounding box:  xmin: 12.4151 ymin: 41.88835 xmax: 12.48316 ymax: 41.91546
#> Geodetic CRS:  WGS 84
  1. Spatial plot
par(mar = rep(0, 4))
plot(st_geometry(vatican))

  1. Plot histogram of lengths
par(mar = c(5, 4, 4, 2))
hist(st_length(vatican), breaks = 20)

  1. Add one (internal) point every 250m
vatican <- st_geod_segmentize(vatican, units::set_units(250, "m"))
  1. (Sub)divide setting 5 vertices per output
vatican <- st_subdivide(vatican, 5) %>% st_collection_extract("LINESTRING")
  1. Plot (it looks more or less the same as before)
par(mar = rep(0, 4))
plot(st_geometry(vatican))

  1. But
par(mar = c(5, 4, 4, 2))
hist(st_length(vatican), breaks = 20)

Created on 2021-05-02 by the reprex package (v2.0.0)

The main downside is that you lose all fields related to the input data. Moreover, I'm not sure how (or if) it scales with bigger input objects.

Robinlovelace commented 3 years ago

Great reproducible example @agila5, thanks for reviving this issue @OwenTheAnalyst which seems to have become more of a discussion than a clearly defined issue.

There is a highly efficient implementation of this in GRASS which can be accessed via the relatively new (developed after this issue was opened) qgisprocess package https://github.com/paleolimbot/qgisprocess/issues/26

output = qgis_run_algorithm(
  algorithm = "grass7:v.split",
  input = rnet_10,
  length = 500
)

How should we proceed with this issue @agila5 ? What is the actionable output? I imagine that creating a function that wraps that could solve the problem, e.g.:

rnet_breakup_split_distances(rnet, length) {
  # ...
}

Could be good. Regarding re-joining them I think aggregate/dplyr functions have everything needed, no?

Tentative suggestion: move this issue to a discussion and open a new issue on split_distances() when qgisprocess is on CRAN...

agila5 commented 3 years ago

How should we proceed with this issue? What is the actionable output? I imagine that creating a function that wraps that could solve the problem

LGTM!