Slow queries when using POST /stops endpoint for large distances

cedarbaum commented 2 years ago

I have a use case where I want to get a list of all or most stops (+ service maps) from nearest to furthest. Though probably not exactly what this API was designed for, I've been using POST /stops... with a large distance parameter to do this. This works but becomes slow when trying to include many stops - I've included some basic measurements below.

I believe I found the main query that becomes slow with many stops (also included below). It's not immediately obvious to me if this can easily be improved (e.g., with additional indexing or a different query structure) - do you have any ideas or is this fundamentally difficult to optimize given the DB structure?

On a side note, this is a really cool project and I've enjoyed using it!

Performance data:

# 1000 meters
❯ curl -w "@curl-format.txt" -o /dev/null -s --compressed -X POST $TRANSITER_SERVER'/systems/us-ny-subway/stops/?latitude=$lat&longitude=$lon&distance=1000'
     time_namelookup:  0.000015s
        time_connect:  0.000081s
     time_appconnect:  0.000000s
    time_pretransfer:  0.000120s
       time_redirect:  0.000000s
  time_starttransfer:  0.117868s
                     ----------
          time_total:  0.117938s

# 5000 meters
❯ curl -w "@curl-format.txt" -o /dev/null -s --compressed -X POST $TRANSITER_SERVER'/systems/us-ny-subway/stops/?latitude=$lat&longitude=$lon&distance=5000'
     time_namelookup:  0.000020s
        time_connect:  0.000095s
     time_appconnect:  0.000000s
    time_pretransfer:  0.000121s
       time_redirect:  0.000000s
  time_starttransfer:  2.209732s
                     ----------
          time_total:  2.246882s

Example query

SELECT stop.pk AS stop_pk, service_map_group.id AS service_map_group_id, route.pk AS route_pk, route.id AS route_id, route.agency_pk AS route_agency_pk, route.source_pk AS route_source_pk, route.system_pk AS route_system_pk, route.color AS route_color, route.text_color AS route_text_color, route.short_name AS route_short_name, route.long_name AS route_long_name, route.description AS route_description, route.url AS route_url, route.type AS route_type, route.sort_order AS route_sort_order, route.continuous_pickup AS route_continuous_pickup, route.continuous_drop_off AS route_continuous_drop_off 
FROM stop JOIN system ON system.pk = stop.system_pk JOIN service_map_group ON service_map_group.system_pk = system.pk AND service_map_group.use_for_routes_at_stop LEFT OUTER JOIN service_map ON service_map.group_pk = service_map_group.pk AND service_map.pk IN (SELECT service_map_vertex.map_pk AS service_map_vertex_map_pk 
FROM service_map_vertex 
WHERE service_map_vertex.stop_pk = stop.pk) LEFT OUTER JOIN route ON route.pk = service_map.route_pk 
WHERE stop.pk IN (1024, 1153, 514, 898, 1156, 517, 901, 1159, 904, 1162, 1165, 1297, 790, 1531, 793, 796, 799, 802, 439, 442, 445, 448, 451, 1348, 454, 1351, 457, 1354, 460, 1357, 463, 1360, 1363, 85, 1240, 88, 91, 1243, 1501, 94, 1246, 97, 610, 1249, 1252, 613, 1255, 616, 1513, 1514, 1515, 1258, 1516, 1523, 1012, 1015, 1144, 1529, 1018, 1147, 1021, 1150, 895)

jamespfennell commented 2 years ago

Thank you for the report and the investigation! 2 seconds for a 5km search is pretty bad!

I played around with the POST /stops endpoint and the query you posted and I think you're exactly right - it's a really bad query and the reason this endpoint is so slow.

Even worse, I changed the last condition in the query so that it would return all data for the NYC subway and this one takes 20 seconds!

SELECT stop.pk AS stop_pk, service_map_group.id AS service_map_group_id, route.pk AS route_pk, route.id AS route_id, route.agency_pk AS route_agency_pk, route.source_pk AS route_source_pk, route.system_pk AS route_system_pk, route.color AS route_color, route.text_color AS route_text_color, route.short_name AS route_short_name, route.long_name AS route_long_name, route.description AS route_description, route.url AS route_url, route.type AS route_type, route.sort_order AS route_sort_order, route.continuous_pickup AS route_continuous_pickup, route.continuous_drop_off AS route_continuous_drop_off 
FROM stop JOIN system ON system.pk = stop.system_pk JOIN service_map_group ON service_map_group.system_pk = system.pk AND service_map_group.use_for_routes_at_stop LEFT OUTER JOIN service_map ON service_map.group_pk = service_map_group.pk AND service_map.pk IN (SELECT service_map_vertex.map_pk AS service_map_vertex_map_pk 
FROM service_map_vertex 
WHERE service_map_vertex.stop_pk = stop.pk) LEFT OUTER JOIN route ON route.pk = service_map.route_pk 
WHERE system.pk = 1;

However, I think it's possible to do much better. The following query returns more-or-less the same information about all stops in the subway and takes only ~100ms to run:

SELECT stop.pk AS stop_pk, stop.id AS stop_id, service_map_group.id AS service_map_group_id, route.pk AS route_pk, route.id AS route_id
FROM stop
JOIN system 
    ON system.pk = stop.system_pk 
JOIN service_map_group 
    ON service_map_group.system_pk = system.pk 
    AND service_map_group.use_for_routes_at_stop 
LEFT OUTER JOIN service_map 
    ON service_map.group_pk = service_map_group.pk
LEFT OUTER JOIN service_map_vertex
    ON service_map_vertex.map_pk = service_map.pk 
    AND service_map_vertex.stop_pk = stop.pk
LEFT OUTER JOIN route ON route.pk = service_map.route_pk 
WHERE system.pk = 1;

So I think we have a path to making this endpoint not-too-slow and usable for your use-case.

In terms of fixing the code: over the last few months I've actually been working on porting Transiter from Python to Go - this is happening on the Go branch. The reasons were partially performance related. This migration is nearly done and I was planning to merge it into mainline in a couple of weeks.

I haven't implemented the geographical search endpoint for the Go code yet - so I guess one solution here is just to use better queries when I do that and ensure that the endpoint is benchmarked? Updating the Python code is also possible...for example this is where the slow query is built.

cedarbaum commented 2 years ago

Appreciate the follow-up! I think if the Go version is the future, then using the optimized query there would be great!

On a slight tangent, will the Go version still be designed to be deployed on k8s and have separately scalable web and feed update services?

jamespfennell commented 2 years ago

The Go version is still designed to be deployed as a Docker image (the Go branch has a working Dockerfile), which means Docker Compose or Kubernetes depending on your setup. One simplification is that it no longer requires Celery and the 4 Python jobs become a single Go job, so the configuration is a little simpler.

Regarding scaling: yes for web services, but no for the feed update process. In the Go architecture there is a single binary scheduling and executing feed updates. It's true the Python version can distribute feed update work across multiple nodes, but this comes at a large complexity cost and is sort of overkill in my opinion. Postgres only supports a single write instance which must be on a single machine, so scaling the feed update process across multiple machines doesn't really unlock much.

jamespfennell / transiter

Slow queries when using POST /stops endpoint for large distances #85

Performance data:

Example query