jamespfennell / transiter

Web service for transit data
https://demo.transiter.dev
MIT License
55 stars 6 forks source link

Modify stops endpoint to allow filtering by distance from a location #90

Closed cedarbaum closed 1 year ago

cedarbaum commented 1 year ago

This is a proposed change to allow fetching stops by geolocation. It is similar to the API that existed in the Python version, with some differences:

If this proposed API looks good, there are a couple more tasks to complete:

jamespfennell commented 1 year ago

Hi Sam - awesome to see a PR on the new Go code for this feature! I’m currently traveling with no computer access, and so will engage more deeply when I get back home next week if that’s okay?

My high level thoughts are, first, that your idea of having the geographic search integrated in the GET endpoint is the right way to go. I think fewer endpoints = simpler API.

Second, the concern I have with this approach, which is why I originally created a different endpoint, is to do with performance. If the SQL query has a WHERE clause involving the latitude and longitude then a full table scan will be needed to resolve the query. However I think from looking at the PR that you have already factored this in and you use the geographic SQL query only when geographic search is requested. This is awesome.

So basically the endpoint has two modes depending on whether geographic search is requested (using your new FilterByDistance parameter). I’m thinking perhaps we should more explicitly expose this dichotomy in the API? For example, if a user requests a geographic search maybe we should be explicit that the results will be ordered by distance, rather than stop ID? (Or should we not order them by distance?) And perhaps we should disallow pagination, or support a different kind of pagination based on distance?

On Mon, Jan 16, 2023, at 4:43 PM, Sam Cedarbaum wrote:

This is a proposed change to allow fetching stops by geolocation. It is similar to the API that existed in the Python version, with some differences:

  • Does not use a separate POST endpoint

  • Does not return distance with each stop.

    • This was done to simplify the generated type structure. Adding a distance to the return type forces a new type to be reconciled with the model's existing Stop type. If consumers need distance, it can easily be recalculated via the returned Latitude and Longitude fields on Stop. If this proposed API looks good, there are a couple more tasks to complete:
  • Add tests

  • Add documentation

You can view, comment on, or merge this pull request online at:

https://github.com/jamespfennell/transiter/pull/90

Commit Summary

(21 files https://github.com/jamespfennell/transiter/pull/90/files)

— Reply to this email directly, view it on GitHub https://github.com/jamespfennell/transiter/pull/90, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4T7SST4IL3RVBD2L45PO3WSW6GTANCNFSM6AAAAAAT5FJ42U. You are receiving this because you are subscribed to this thread.Message ID: @.***>

cedarbaum commented 1 year ago

Thanks for looking and for the initial feedback! I've updated the PR based on the suggestions:

  1. Explicitly allow sort_mode to be specified when using the geographical version of list stops. This can be ID (the default) or DISTANCE.
  2. I removed pagination - I think, for geographical searches, limiting the max_distance is sufficient if one needs to limit results. Further, the performance for even large distances isn't too bad (see below).

No rush to review this; hope you enjoy your vacation!

Performance metrics

# 1KM
❯ curltime -X GET "localhost:8080/systems/us-ny-subway/stops?max_distance=1&latitude=$LAT&longitude=$LON&filter_by_distance=true"
    time_namelookup:  0.002776s
        time_connect:  0.002998s
     time_appconnect:  0.000000s
    time_pretransfer:  0.003015s
       time_redirect:  0.000000s
  time_starttransfer:  0.065140s
                     ----------
          time_total:  0.065253s

# 5KM
❯ curltime -X GET "localhost:8080/systems/us-ny-subway/stops?max_distance=5&latitude=$LAT&longitude=$LON&filter_by_distance=true"
    time_namelookup:  0.003572s
        time_connect:  0.003805s
     time_appconnect:  0.000000s
    time_pretransfer:  0.003832s
       time_redirect:  0.000000s
  time_starttransfer:  0.300522s
                     ----------
          time_total:  0.306107s

# 500KM
❯ curltime -X GET "localhost:8080/systems/us-ny-subway/stops?max_distance=500&latitude=$LAT&longitude=$LON&filter_by_distance=true"
    time_namelookup:  0.002956s
        time_connect:  0.003194s
     time_appconnect:  0.000000s
    time_pretransfer:  0.003214s
       time_redirect:  0.000000s
  time_starttransfer:  0.781933s
                     ----------
          time_total:  0.799326s
jamespfennell commented 1 year ago

Just getting around to looking now!

From the PR, it seems the list stops endpoint would have 3 "modes":

  1. Listing stops by ID, ordered by ID (current code).
  2. Listing stops within a specified distance of a center point, ordered by the distance from that center point (in this PR, powered by theListStopsInSystemGeo_ByDistance SQL query).
  3. Listing stops within a specified distance of a center point, ordered by ID (also in this PR, powered by the ListStopsInSystemGeo_ById SQL query).

I wonder should we just not include the last use case (3)? Worst case scenario, a caller could just use (2) and then sort on their end which is easy because there is no pagination for the geo queries. Supporting use case (3) makes the API a little trickier I think, and also requires an additional SQL query and code to maintain.

If we only had (1) and (2), then we could change your SORT_MODE argument to be a SEARCH_MODE argument (and remove the filter_by_distance argument). As in the current PR, there would be two values (ID and DISTANCE). In the documentation we could then clearly specify which fields are relevant to each mode. For example, first_id would only be used in ID mode and max_distance would only be used in DISTANCE mode. What do you think?

Thank you again for working on this feature!

cedarbaum commented 1 year ago

I think this makes sense and I've updated the PR with the proposed changes. Please let me know if this looks OK and also what your thoughts on testing are (I didn't see any existing tests for the /stops endpoint yet in the go version). Thanks!

Also, regarding the CI failures, I believe this is due to me using a different version of sqlc than the CI. I was having trouble matching the version exactly, since homebrew only seems to have v1.16.0. Not sure what the best fix is here, though I am certainly open to trying to match the version directly if that's the easiest way forward.

jamespfennell commented 1 year ago

I think this looks really awesome! Also the implementation looks really clean and simple which is great. Thank you so much for contributing it!

Regarding testing: in the Python version I had a bunch of unit tests for all of the endpoints and all of the SQL queries. But I didn't migrate them to Go, partially out of laziness and partially because with Go's type system you get a lot of coverage for free IMO. (The only exception is that I've been trying to add tests for the really complex SQL queries.) Adding unit tests for this PR would perhaps be nice, but I don't think it's so important?

What would be really nice, though, is an end to end test for the geographic search feature. I'm not sure if you've seen them, but there are E2E tests in the tests/endtoend directory. The list stops endpoint is tested in the test_install_system__stops test in test_installsystem.py. The E2E tests use a transit system with some synthetic GTFS data I created in tests/endtoend/data/gtfsstatic/stops.txt. Right now all the stops have silly lat/lon coordinates, but I think we could put in reasonable coordinates, have a test that searches based on some center point, and then verify we get the stops in the right order back based on some manual calculations? Or even multiple center points. This would provide coverage of the API and the SQL query. If you want to give it a shot you could, but I would also be totally happy to write the E2E test if you don't fancy it?

Lastly regarding setup, I've also encountered similar issues when switching between computers. Perhaps the easiest thing would be to bump the version of sqlc in the go.mod file to 1.16? Then the CI env will match yours.

jamespfennell commented 1 year ago

(I just updated the E2E testing docs in case you're interested in trying to run the tests...the docs may still be confusing though!)

cedarbaum commented 1 year ago

Thanks for the advice with the testing! I was able to add some E2E coverage to the existing test_install_system__stops test.

Also, as suggested, I updated sqlc to match my local setup. I also needed to update buf in the CI Dockerfiles, and now everything seems to match! The CI now fails at the "Login to DockerHub" step, which I believe is expected since contributors cannot access secrets with builds. I think this could be fixed by creating another workflow just for PRs that makes no attempt to publish to Docker Hub (or just conditionally not trying to login on PR workflows).

Please let me know if there is any further feedback.

jamespfennell commented 1 year ago

Awesome, the new tests look great! Thanks again for contributing this feature. I hope the back and forth wasn't too burdensome. When it comes to changes to the API I think it's good to spend some time polishing it because it's harder to change after the fact because of backwards compatibility. I think the result here is really great.

I think I've fixed the CI for PRs on main, but am just going to merge this now.

cedarbaum commented 1 year ago

Awesome, thanks so much for the discussion and feedback! I think this issue can probably be closed now: https://github.com/jamespfennell/transiter/issues/85