Open wklumpen opened 1 year ago
At the moment, it looks like the "date coverage" check only checks if TansitMode.TRANSIT is in the departure modes, but not if there is any subset of these modes.
Haven’t yet tested this myself, but this can definitely be the case. I’ll check this out now, should be an easy fix
Not sure if we have a test set up for this (but I suspect it would fail if the user passed a a set of all modes individually).
Hm, this is a bit tricky: supplying multiple GTFS files can have the purpose of covering more than one transport system, but it can be used also to cover a longer period of time, with some times covered by one file and some by others. So there should be different warnings for (1) one of the files do not cover the departure_time
, and (2) none of the files cover departure_time
. For case (1) , we could maybe also only warn when verbose == True
- or at least combine the individual per-data-set warnings to something like ‘some data sets do not cover departure date’
Since we open this up another time: let's also add (1) a check whether the origin and destination geometry is within he geographic extent of the GTFS dataset, and (2) a warning if a transit mode is selected, but no GTFS dataset loaded (we might have this check already, but I'm typing this on the phone)
Finally, (3), let's coordinate this with a possible super-class that combines TransitMode
, StreetMode
, and LegMode
(see issue #265 )
Yes, I figured, and just this fix is literally 2 minutes of work. Going to move the other ideas into a separate issue
Edit: Thinking about it, also the other things are not such a big effort, let’s keep them together
add (1) a check whether the origin and destination geometry is within he geographic extent of the GTFS dataset
I assume we would use stops.txt
as our boundary?
Also given that #265 is a bit different (and has a PR open to fix it) I think we should keep this issue focused on date validation only.
Also, since you’re the GTFS expert: does stops.txt have geometries from which we can create our own convex hull, or should we simply use the feed_info.txt extent?
I’ll link these comments into #265 - definitely a better fit there
From my experience feed_info.txt
is fickle and often doesn't contain anything useful (e.g. Chicago and NYC feeds all are missing complete feed info data even though the file is there and technically compliant). Stops contain stop_lon
and stop_lat
which we could use to create convex hulls, however
Suppose a GTFS covers a single line service or just does subways for example. We wouldn't get much of a convex hull to work with.
I'm doing a similar thing for another project and I'm wondering if it's better to just leave that type of check to the user input. In part because GTFS doesn't need to span the analysis area (people can walk/bike/drive long distances to stops).
What is probably more useful is checking the extent of the PBF file, but perhaps R5 does that already (haven't tried it).
What is probably more useful is checking the extent of the PBF file, but perhaps R5 does that already (haven't tried it).
That’s something worth checking
Suppose a GTFS covers a single line service or just does subways for example. We wouldn't get much of a convex hull to work with.
True, I did not consider that cities with just one line even exist. If the different lines are split into several GTFS feeds, then a convex hull of all stops of all feeds would make sense, though
Even so, I can see use cases where the area created by the convex hull does not span the entire analysis area. If it's a reasonably quick check I think it's fine to do it (are we getting into "verbosity" setting territory?) but it should definitely only be a warning with a clear explanation that it might not be an unintended thing.
Even so, I can see use cases where the area created by the convex hull does not span the entire analysis area. If it's a reasonably quick check I think it's fine to do it (are we getting into "verbosity" setting territory?) but it should definitely only be a warning with a clear explanation that it might not be an unintended thing.
Absolutely, a warning should suffice, and maybe even only when running in verbose mode.
Opened a new issue #271 concerning the geometry/extent checking
From our call:
How do we feel about this being in release 0.0.6?
Now that I'm working with batches of regional GTFS feeds, I have noticed that the warning is thrown even when my own validation using gtfs-lite
doesn't seem to find any issues with the time I've supplied (to the extent that I don't know which of the 7 GTFS files are the culprit), however gtfs-lite
only checks the date, not the specific time. I can try to see if I can find the GTFS feed that's throwing a warning in r5py but not in gtfs-lite.
But that got me thinking further: It is possible that an analysis using a set of GTFS feeds might run outside of one of them (for example, a commuter rail GTFS feed that runs only at peak period, while the analysis time is set for 1400). This might be totally fine (especially for automated tasks). It is also possible that an analysis window is half-covered by a GTFS feed.
In any case, I'm not sure this input date coverage check should be baked-in to the r5py
workflow. I guess it goes to a philosophical discussion of how much data checking we want to provide to the user (without them asking), and how much do we leave in guides/documentation/eventual textbook that we inevitably publish :smile:
Suggested behaviour here is, is that if the mode is set to verbose
or a validate_data
flag is supplied, AND at least one transit mode is specified in transport_modes
:
I would not introduce an additional validate_data
flag, but with everything else I’m with you.
Should the exception be thrown independently of whether or not we’re in verbose
mode?
I remember you mentioning that the information in calendar.txt
often is imprecise. However, the premise ‘at least one departure within the departure_window
might not be bullet-proof either: in some of our research, for instance, we look at diurnal differences in accessibility, and it’s perfectly valid that certain places do not have any departure within the time window (say) 2-3am. How bad is the calendar.txt situation, really? Can we somehow find a middle way (derive start_date and end_date from actual routes and not simply read it from the metadata?)?
In any case, I'm not sure this input date coverage check should be baked-in to the r5py workflow. I guess it goes to a philosophical discussion of how much data checking we want to provide to the user (without them asking), and how much do we leave in guides/documentation/eventual textbook that we inevitably publish
Philosophical, indeed! I think warning if verbose feedback requested, is alright, and failing if no file claims (!) to cover the departure_window
, is fine, as well. Not sure whether we should go much further, TBH
Reminding myself here that a RuntimeWarning
is being thrown still for GTFS sets that should be valid, but more importantly one or more of them are valid in a larger set.
Perhaps the wording of the warning here should be "One or more of the GTFS data sets is outside the time range covered by currently loaded data sets"
@christophfink do we still want to provide a check for the dates? If so, we need to consider the following (as I do for GTFS-lite):
calendar.txt
service that covers the date?calendar_dates.txt
service that adds to the date (enum 1), or alternatively removes service on the date (enum 2)Note: this does not check if there are any trip on that date. For example, a calendar.txt
definition might only cover weekdays, and will appear valid for an analysis run on a Saturday despite not running service on a Saturday.
Also note: I don't know how extensive the checking of this data is on the R5 end.
This is a pretty serious can of worms that I've tackled over at GTFS-lite with valid_date()
and date_trips()
As I've expressed before, my opinion is to leave the data wrangling and validation up to the user. We can provide guides and how-tos if we are interested, but ultimately there are so many permutations of what might be going wrong we can't cover them all in R5py without basically recreating things like the MobilityData Validator or GTFS-Lite package.
I see too viable options really:
Loading in a file can be quite slow (5-6 seconds) if the feed is large - but perhaps the tradeoff is worth it. Could also be something we allow the user to specify?
My vote is to put those few lines of code outside of the regional task and have people check their dates ahead of time, much like other data preparation steps.
Happy to make a PR to include a validation example and remove the warning if this is the way we go.
Happy to make a PR to include a validation example and remove the warning if this is the way we go. That would be very much appreciated, thanks!
TODO: Remove date checking warning and add date checking code to examples.
At the moment, it looks like the "date coverage" check only checks if
TansitMode.TRANSIT
is in the departure modes, but not if there is any subset of these modes.Not sure if we have a test set up for this (but I suspect it would fail if the user passed a a set of all modes individually).
Further the
RuntimeWarning
that gets thrown if a date is off should ideally specify which feed in the GTFS list is causing the issue (currently trying to debug whether this is throwing a warning unnecessarily.