Add comprehensive date checking in setters of `regional_task.py`

wklumpen commented 1 year ago

At the moment, it looks like the "date coverage" check only checks if TansitMode.TRANSIT is in the departure modes, but not if there is any subset of these modes.

Not sure if we have a test set up for this (but I suspect it would fail if the user passed a a set of all modes individually).

Further the RuntimeWarning that gets thrown if a date is off should ideally specify which feed in the GTFS list is causing the issue (currently trying to debug whether this is throwing a warning unnecessarily.

christophfink commented 1 year ago

At the moment, it looks like the "date coverage" check only checks if TansitMode.TRANSIT is in the departure modes, but not if there is any subset of these modes.

Haven’t yet tested this myself, but this can definitely be the case. I’ll check this out now, should be an easy fix

Not sure if we have a test set up for this (but I suspect it would fail if the user passed a a set of all modes individually).

Hm, this is a bit tricky: supplying multiple GTFS files can have the purpose of covering more than one transport system, but it can be used also to cover a longer period of time, with some times covered by one file and some by others. So there should be different warnings for (1) one of the files do not cover the departure_time, and (2) none of the files cover departure_time. For case (1) , we could maybe also only warn when verbose == True - or at least combine the individual per-data-set warnings to something like ‘some data sets do not cover departure date’

christophfink commented 1 year ago

Since we open this up another time: let's also add (1) a check whether the origin and destination geometry is within he geographic extent of the GTFS dataset, and (2) a warning if a transit mode is selected, but no GTFS dataset loaded (we might have this check already, but I'm typing this on the phone)

Finally, (3), let's coordinate this with a possible super-class that combines TransitMode, StreetMode, and LegMode (see issue #265 )

wklumpen commented 1 year ago

FYI I was referring to this line of code.

christophfink commented 1 year ago

Yes, I figured, and just this fix is literally 2 minutes of work. ~~Going to move the other ideas into a separate issue~~

Edit: Thinking about it, also the other things are not such a big effort, let’s keep them together

wklumpen commented 1 year ago

add (1) a check whether the origin and destination geometry is within he geographic extent of the GTFS dataset

I assume we would use stops.txt as our boundary?

wklumpen commented 1 year ago

Also given that #265 is a bit different (and has a PR open to fix it) I think we should keep this issue focused on date validation only.

christophfink commented 1 year ago

Also, since you’re the GTFS expert: does stops.txt have geometries from which we can create our own convex hull, or should we simply use the feed_info.txt extent?

I’ll link these comments into #265 - definitely a better fit there

wklumpen commented 1 year ago

From my experience feed_info.txt is fickle and often doesn't contain anything useful (e.g. Chicago and NYC feeds all are missing complete feed info data even though the file is there and technically compliant). Stops contain stop_lon and stop_lat which we could use to create convex hulls, however

Suppose a GTFS covers a single line service or just does subways for example. We wouldn't get much of a convex hull to work with.

I'm doing a similar thing for another project and I'm wondering if it's better to just leave that type of check to the user input. In part because GTFS doesn't need to span the analysis area (people can walk/bike/drive long distances to stops).

What is probably more useful is checking the extent of the PBF file, but perhaps R5 does that already (haven't tried it).

christophfink commented 1 year ago

What is probably more useful is checking the extent of the PBF file, but perhaps R5 does that already (haven't tried it).

That’s something worth checking

christophfink commented 1 year ago

Suppose a GTFS covers a single line service or just does subways for example. We wouldn't get much of a convex hull to work with.

True, I did not consider that cities with just one line even exist. If the different lines are split into several GTFS feeds, then a convex hull of all stops of all feeds would make sense, though

wklumpen commented 1 year ago

Even so, I can see use cases where the area created by the convex hull does not span the entire analysis area. If it's a reasonably quick check I think it's fine to do it (are we getting into "verbosity" setting territory?) but it should definitely only be a warning with a clear explanation that it might not be an unintended thing.

christophfink commented 1 year ago

Even so, I can see use cases where the area created by the convex hull does not span the entire analysis area. If it's a reasonably quick check I think it's fine to do it (are we getting into "verbosity" setting territory?) but it should definitely only be a warning with a clear explanation that it might not be an unintended thing.

Absolutely, a warning should suffice, and maybe even only when running in verbose mode.

christophfink commented 1 year ago

Opened a new issue #271 concerning the geometry/extent checking

christophfink commented 1 year ago

From our call:

if the user routes with a transit mode:
- if all supplied GTFS Feeds cover the departure datetime: go ahead
- if at least one GTFS Feed does not cover it: throw Exception

wklumpen commented 1 year ago

How do we feel about this being in release 0.0.6?

wklumpen commented 1 year ago

Now that I'm working with batches of regional GTFS feeds, I have noticed that the warning is thrown even when my own validation using gtfs-lite doesn't seem to find any issues with the time I've supplied (to the extent that I don't know which of the 7 GTFS files are the culprit), however gtfs-lite only checks the date, not the specific time. I can try to see if I can find the GTFS feed that's throwing a warning in r5py but not in gtfs-lite.

But that got me thinking further: It is possible that an analysis using a set of GTFS feeds might run outside of one of them (for example, a commuter rail GTFS feed that runs only at peak period, while the analysis time is set for 1400). This might be totally fine (especially for automated tasks). It is also possible that an analysis window is half-covered by a GTFS feed.

In any case, I'm not sure this input date coverage check should be baked-in to the r5py workflow. I guess it goes to a philosophical discussion of how much data checking we want to provide to the user (without them asking), and how much do we leave in guides/documentation/eventual textbook that we inevitably publish :smile:

Suggested behaviour here is, is that if the mode is set to verbose or a validate_data flag is supplied, AND at least one transit mode is specified in transport_modes:

Check each GTFS feed to see if there's at least one stop-time within the start+duration window
If no feeds have a valid date, throw an exception
If one or more feeds are not in the valid range, throw a warning specifying the file

christophfink commented 1 year ago

I would not introduce an additional validate_data flag, but with everything else I’m with you.

Should the exception be thrown independently of whether or not we’re in verbose mode?

I remember you mentioning that the information in calendar.txt often is imprecise. However, the premise ‘at least one departure within the departure_window might not be bullet-proof either: in some of our research, for instance, we look at diurnal differences in accessibility, and it’s perfectly valid that certain places do not have any departure within the time window (say) 2-3am. How bad is the calendar.txt situation, really? Can we somehow find a middle way (derive start_date and end_date from actual routes and not simply read it from the metadata?)?

christophfink commented 1 year ago

In any case, I'm not sure this input date coverage check should be baked-in to the r5py workflow. I guess it goes to a philosophical discussion of how much data checking we want to provide to the user (without them asking), and how much do we leave in guides/documentation/eventual textbook that we inevitably publish

Philosophical, indeed! I think warning if verbose feedback requested, is alright, and failing if no file claims (!) to cover the departure_window, is fine, as well. Not sure whether we should go much further, TBH

wklumpen commented 1 year ago

Reminding myself here that a RuntimeWarning is being thrown still for GTFS sets that should be valid, but more importantly one or more of them are valid in a larger set.

Perhaps the wording of the warning here should be "One or more of the GTFS data sets is outside the time range covered by currently loaded data sets"

wklumpen commented 11 months ago

@christophfink do we still want to provide a check for the dates? If so, we need to consider the following (as I do for GTFS-lite):

Is there a calendar.txt service that covers the date?
Is there a calendar_dates.txt service that adds to the date (enum 1), or alternatively removes service on the date (enum 2)

Note: this does not check if there are any trip on that date. For example, a calendar.txt definition might only cover weekdays, and will appear valid for an analysis run on a Saturday despite not running service on a Saturday.

Also note: I don't know how extensive the checking of this data is on the R5 end.

This is a pretty serious can of worms that I've tackled over at GTFS-lite with valid_date() and date_trips()

As I've expressed before, my opinion is to leave the data wrangling and validation up to the user. We can provide guides and how-tos if we are interested, but ultimately there are so many permutations of what might be going wrong we can't cover them all in R5py without basically recreating things like the MobilityData Validator or GTFS-Lite package.

christophfink commented 11 months ago

I see too viable options really:

kick out all the date checking; instead adding some more emphasis to this in the documentation
import gtfs-lite and run its checks before importing a GTFS file: a bit of overhead, but probably thorough

wklumpen commented 11 months ago

Loading in a file can be quite slow (5-6 seconds) if the feed is large - but perhaps the tradeoff is worth it. Could also be something we allow the user to specify?

My vote is to put those few lines of code outside of the regional task and have people check their dates ahead of time, much like other data preparation steps.

Happy to make a PR to include a validation example and remove the warning if this is the way we go.

christophfink commented 11 months ago

Happy to make a PR to include a validation example and remove the warning if this is the way we go. That would be very much appreciated, thanks!

wklumpen commented 11 months ago

TODO: Remove date checking warning and add date checking code to examples.

r5py / r5py

Add comprehensive date checking in setters of `regional_task.py` #266