r-transit / tidytransit

R package for working with GTFS data
https://r-transit.github.io/tidytransit/
143 stars 21 forks source link

re-write the frequency vignette and improve introduction vignette #108

Closed tbuckl closed 4 years ago

tbuckl commented 5 years ago

see https://github.com/r-transit/tidytransit/issues/106 for more

tbuckl commented 5 years ago

@polettif i created this issue and tagged all the other open issues with frequencies in order to clarify the project.

tbuckl commented 5 years ago

i've said this before, but i think the issue is that GTFS wasn't designed to calculate frequency of service. so when feeds show up for which it doesn't work, i'm not that surprised. its not that they are "non-standard." quite the contrary. the standard is just not built for frequencies. in the package, its calculation is a hack.

polettif commented 5 years ago

Thanks for clearing up the issues.

i've said this before, but i think the issue is that GTFS wasn't designed to calculate frequency of service.

Yeah, I'd say so too. Personally, I haven't come across many feeds using frequencies.txt to begin with, probably because they offer less flexibility. And calculating general frequencies from stop_times is, as you said, pretty difficult...

polettif commented 4 years ago

What's the status on this @tbuckl? Because I started working on generalising sf data and I ran into several problem with it because route frequency, stop frequency and their generating/plotting functions are intertwined quite a bit. This wouldn't be a problem as such but since not all feeds have shapes and frequencies it's hard to make it work for basic feeds.

I'd like to make the following changes which might lead to reduced functionality for some use cases (at least initially):

  1. remove route frequency calculations from read_gtfs since there's, as you said, not a standard way yet to calculated these. It would still be possible via get_route_frequency
  2. plot.gtfs() only plots stops and shapes (if available) in some form, not route frequencies
  3. The only sf data frames stored in a gtfs_obj should be stops_sf and shapes_sf, as they represent the basic geospatial data in a feed. Currently routes_sf uses (calculated) frequencies and is created from shapes via trips. That functionality should be moved to a standalone function.
tbuckl commented 4 years ago

@polettif this is a pretty big issue, so i haven't had time to tackle it. i'm glad you've thought a bit about how to break it up below. i can imagine a number of issues coming out of this discussion.

thoughts on your proposed changes:

1) this is a good point. two thoughts. first, i think we'd need to develop a strong reason to deprecate functionality that is optional and currently used. second, it sounds like the reason is: there is no standard. but i think there are plenty of examples of successful standards that aren't correct but that are frequently used (atomic precision in GeoJSON and EPSG 4326 come to mind). most of geojson coordinates in 4326 are overly precise. and so, strictly speaking, not correct. however, that doesn't keep people from using geojson. on the contrary, its a popular standard because it "just works." in my experience, thats also true of the frequency functions here. i think it might make sense to open an issue in which we more clearly document for users that frequencies are approximations (through warnings in the console). this is similar to how st_transform works on spherical coordinates (4326), where the user is alerted that the function is basically "off" by a lot for the data type they are using. if that sounds ok, then i can open an issue for that.

2) makes sense.

3) this makes sense in general too. but again, as with (1), i'm unclear on how it looks for users and what the compelling reason to remove functionality for users is. many users new to GTFS want to know "whats the geometry of X route." i've actually forgotten: does routes_sf necessitate frequencies in the package? my intent was to have routes_sf just be routes and not require frequencies.

anyway, should have some time to provide more thoughtful comments next weekend.

polettif commented 4 years ago

Thanks for your reply. I didn't consider backwards compatibility enough, good point.

second, it sounds like the reason is: there is no standard.

Sorry, I didn't phrase it right, I wasn't referring to the GTFS standard in any way. Currently get_route_frequency and get_stop_frequency use default parameters to calculate frequencies which are not documented enough, IMO. While those defaults might be reasonable, it's not trivial to extract the service period you want to do calculations on (as discussed in #72, #106 and as you pointed out in #119).

I guess my point is: Any analysis should be done with purposely defined parameters after the feed has been read. As it stands now, using frequency=TRUE is a bit of a black box.

this makes sense in general too. but again, as with (1), i'm unclear on how it looks for users and what the compelling reason to remove functionality for users is. many users new to GTFS want to know "whats the geometry of X route."

The basic geometry elements (besides stops) are shapes which are connected to one or more trips. If we do have a shapes_sf (#115), the shapes for a route can be obtained like this

gtfs_obj$trips %>% 
  filter(route_id == "route_id" & service_id = "service_id") %>% 
  right_join(gtsf_obj$.$shapes_sf, by = "shape_id")

which could very well be part of its own function (e.g. get_route_shape). I don't think we should precalculate join-and-filter operations upon reading the feed. This might take a very long time for large feeds and means we calculate the same things twice, which are not all needed in the end.

does routes_sf necessitate frequencies in the package? my intent was to have routes_sf just be routes and not require frequencies.

No I'm sorry, I mixed up functions. This was about plot.gtfs.

tbuckl commented 4 years ago

@polettif ah, i see. you're right! the process for many (if not all) feeds should be to let the user 1) read the feed 2) determine for themselves the right service id and then 3) calculate frequency or route geometry as they need. i can imagine re-working the introductory and frequencies vignette in such a way. then we can deprecate the frequency=TRUE/FALSE and geometry=TRUE/FALSE flag over time. i'll rework the issues to reflect this plan.