ropensci / stplanr

Sustainable transport planning with R
https://docs.ropensci.org/stplanr
Other
417 stars 66 forks source link

Test `rnet_join()` on large datasets #509

Closed Robinlovelace closed 11 months ago

Robinlovelace commented 1 year ago

Thinking: rnet_x is the larger OSM network. Values from rnet_y will be joined onto OSM geometries.

Early step: reducing the size of large OSM networks, here:

https://github.com/ropensci/stplanr/blob/5cf154f56d95a4a8e4a0cd70c3793632b882eaef/R/rnet_join.R#L70-L72

temospena commented 1 year ago

I made a test:

rnet_x = transport_network #OSM_id is the first column
nrow(rnet_x) # 2184086
rnet_y = rnet_ferry4_overline_morethan100_clean %>% mutate(id = 1:nrow(rnet_ferry4_overline_morethan100_clean))
nrow(rnet_y) # 10590

time_i = Sys.time()
rnet_join = rnet_join(rnet_x, rnet_y)
Sys.time() - time_i
# Time difference of 3.987354 mins
nrow(rnet_join) # 85442
Robinlovelace commented 1 year ago

The next step, now you have OSM ids associated with the attributes in rnet_y, is to group by osm ID, dropping the buffer geometry, and running an inner join. Should be quicker. Example from the docs:

https://github.com/ropensci/stplanr/blob/5cf154f56d95a4a8e4a0cd70c3793632b882eaef/R/rnet_join.R#L49-L57

Robinlovelace commented 1 year ago

And see results on test network here: https://docs.ropensci.org/stplanr/reference/rnet_join.html#ref-examples

temospena commented 1 year ago

so, I did that process. The result is the OSM network with the flow attributes [Bike] (and some others that I need)

rnetj_summary = rnet_join %>%
  sf::st_drop_geometry() %>%
  group_by(osm_id) %>%
    summarise(
      Bike = weighted.mean(Bike, length_y, na.rm = TRUE),
      Total = weighted.mean(Total, length_y, na.rm = TRUE),
      new_cyc4 = weighted.mean(new_cyc4, length_y, na.rm = TRUE),
      cyc4 = weighted.mean(cyc4, length_y, na.rm = TRUE),
      new_cyc10 = weighted.mean(new_cyc10, length_y, na.rm = TRUE),
      cyc10 = weighted.mean(cyc10, length_y, na.rm = TRUE),
      )
osm_joined_rnet = left_join(rnet_x, rnetj_summary)
nrow(osm_joined_rnet) #2184086

sum(rnet_y$Bike) # 250545
sum(osm_joined_rnet$Bike, na.rm = TRUE) # 2112663

But the sum of the bike flow is not the same. I understand that it comes from splitting the road network in 5m segments (the default).

Robinlovelace commented 1 year ago

How about the total flow on the network, e.g.

sum(rnet_y$Bike * rnet_y$length)

?

Robinlovelace commented 1 year ago

And same for osm_joined_rnet. If those are out it looks like there's an issue.

Robinlovelace commented 1 year ago

I understand that it comes from splitting the road network in 5m segments (the default).

Not split into 5 m segments, the buffers around start and end points in rnet_x are 5 m.

temospena commented 1 year ago

And same for osm_joined_rnet. If those are out it looks like there's an issue.

rnet_y$length = as.numeric(st_length(rnet_y))
sum(rnet_y$Bike * rnet_y$length) # 17115920
osm_joined_rnet$length = as.numeric(st_length(osm_joined_rnet))
sum(osm_joined_rnet$Bike * osm_joined_rnet$length, na.rm = TRUE) #47510538

In fact they are not the same.

temospena commented 1 year ago

But there could be an explanation to it, once there are much more segments than ids, rigth?

temospena commented 1 year ago
# use NEW rnet_join() function from stplanr ---------------------------------------------------

library(tidyverse)
library(sf)
library(stplanr)

rnet_x = readRDS(url("https://github.com/U-Shift/biclar/releases/download/0.0.1/transport_network.Rds")) #OSM_id is the first column
nrow(rnet_x) # 2184086
rnet_y = readRDS(url("https://github.com/U-Shift/biclar/releases/download/0.0.1/rnet_ferry4_overline_morethan100_clean.Rds")) 
rnet_y$id = 1:nrow(rnet_y))
nrow(rnet_y) # 10590

time_i = Sys.time()
rnet_join = rnet_join(rnet_x, rnet_y)
Sys.time() - time_i
# Time difference of 3.987354 mins
nrow(rnet_join) # 85442

rnetj_summary = rnet_join %>%
  sf::st_drop_geometry() %>%
  group_by(osm_id) %>%
  summarise(
    Bike = weighted.mean(Bike, length_y, na.rm = TRUE),
    Total = weighted.mean(Total, length_y, na.rm = TRUE),
    new_cyc4 = weighted.mean(new_cyc4, length_y, na.rm = TRUE),
    cyc4 = weighted.mean(cyc4, length_y, na.rm = TRUE),
    new_cyc10 = weighted.mean(new_cyc10, length_y, na.rm = TRUE),
    cyc10 = weighted.mean(cyc10, length_y, na.rm = TRUE),
  )
osm_joined_rnet = left_join(rnet_x, rnetj_summary)
nrow(osm_joined_rnet) #2184086

sum(rnet_y$Bike) # 250545
sum(osm_joined_rnet$Bike, na.rm = TRUE) # 2112663

rnet_y$length = as.numeric(st_length(rnet_y))
sum(rnet_y$Bike * rnet_y$length) # 17115920
osm_joined_rnet$length = as.numeric(st_length(osm_joined_rnet))
sum(osm_joined_rnet$Bike * osm_joined_rnet$length, na.rm = TRUE) #47510538
#are those the same? 

### reverse process - which in theory makes more sense

rnet_a = rnet_y
rnet_y = rnet_x #rnet from computed routes
nrow(rnet_y) # 10590
rnet_x = rnet_a #OSM_id is the first column
nrow(rnet_x) # 2184086
rm(rnet_a)

time_i = Sys.time()
osm_subset = rnet_subset(rnet_y, rnet_x) #reduce the large OSM network
Sys.time() - time_i # Time difference of 2.259305 mins
nrow(osm_subset) # 58876

time_i = Sys.time()
rnet_join2 = rnet_join(rnet_x, osm_subset, key_column = 10) # id - is this parameter working??
Sys.time() - time_i# Time difference of 1.233207 mins
nrow(rnet_join2) # 53298

rnetj_summary2 = rnet_join2 %>% # WHERE IS THE ID  #STOP HERE
  sf::st_drop_geometry() %>%
  group_by(id) %>%
  summarise(
    quietness = weighted.mean(quietness, length_y, na.rm = TRUE),
    carspeed = weighted.mean(car_speed, length_y, na.rm = TRUE),
    carspeed_max = max(car_speed, na.rm = TRUE),
  )
osm_joined_rnet2 = left_join(rnet_y, rnetj_summary2)
nrow(osm_joined_rnet2) #
Robinlovelace commented 11 months ago

Details:

image

Robinlovelace commented 11 months ago

Fixed result with angle_max_diff seemingly working:

image

Robinlovelace commented 11 months ago

A weird artefact I found:

image

Robinlovelace commented 11 months ago

image

Fixed with max_angle_diff = 20.

Robinlovelace commented 11 months ago

The function has now been tested. Not the fastest. But works!