spiegel-data / 2019-02-tempolimit

Zusammenhang zwischen Tempolimit und Autobahnunfällen
12 stars 0 forks source link

Performance of "Verkehrsdichte" #4

Closed nuntius35 closed 5 years ago

nuntius35 commented 5 years ago

Matching the count locations to the motorways seems to be the computational most expensive step. On my laptop it takes about 15 minutes which I find totally unacceptable. Let me suggest more efficient code. The code is slightly longer but on my laptop it runs in less than 5 s!. The trick is to compute the Voronoi tessellation first: Each polygon of the voronoi tessellation around a count location contains all points, which are closest to this count location. Hence, if a way lies entirely inside such a polygon, the closest count location is already found. If a way belongs to several Voronoi polygons (this happens for 2100 out of 63347 ways), we simply search for the closest count location which, in such cases, will be very near.

Here is the code to replace your lines 107 to 119 (the mask is not needed).

# Compute the Voronoi tessellation of the counting locations
traffic_voronoi = traffic_2017  %>%
  st_geometry(.) %>%
  st_union(.) %>%
  st_voronoi(.) %>%
  st_collection_extract(.)

# Match ways with their corresponding Voronoi region
count_locations = st_intersects(data_OSM_filtered, traffic_voronoi)
voronoi_index = st_intersects(traffic_voronoi, traffic_2017)

# If a way lies in only one region, this is the nearest counting location
unique_count = (lengths(count_locations) == 1)
closest_count_location_1 = data_OSM_filtered[unique_count,] %>% 
    mutate(nearest_counter_id = as.numeric(voronoi_index[unlist(count_locations[unique_count])]))

# Else a way intersects more than one Voronoi region, find the closest
closest_count_location_2 = data_OSM_filtered[!unique_count,] %>% 
    mutate(nearest_counter_id = st_nearest_feature(., traffic_2017))

# Concatenate the results
closest_count_location = rbind(closest_count_location_1, closest_count_location_2)
nuntius35 commented 5 years ago

I am sorry, maybe my previous code was stupid. What I want to say is this:

The code

closest_count_location = data_OSM_filtered %>% 
    # keeping only sections in bundesländer where we have data on accidents
    st_intersection(mask_2017) %>%
    mutate(nearest_counter_id = st_nearest_feature(st_zm(.), traffic_2017))

takes 15 Minutes for me, but actually the code

closest_count_location = data_OSM_filtered %>% 
    mutate(nearest_counter_id = st_nearest_feature(st_zm(.), traffic_2017))

takes a few seconds. So for me, only the mask is problem. As I have mentioned in issue #2 I am running this code with the original shapefiles, not with the simplified ones. So maybe there is no problem at all.

PatrickStotz commented 5 years ago

I totally see you're point. Sounds like this is improving performance a lot. But the main purpose of this repository is transparency and intelligibly. The proposed code is harder to understand and since I'm not a hundret percent familiar with all steps you're performing, I can't guarantee its accuracy. I thus decide to keep it simple and slow instead of more complex and faster. But thanks for the proposition. Much appreciated!

nuntius35 commented 5 years ago

With the simplified shapefiles, the code in question now runs in about 100 s for me. That's much better than the 15 min I got on the first try.

I agree that good readability of the code is important.