r-transit / tidytransit

R package for working with GTFS data
https://r-transit.github.io/tidytransit/
150 stars 22 forks source link

common GTFS time bug "on those days that the DST <-> standard time switch occurs on" also affects this package #175

Open derhuerst opened 3 years ago

derhuerst commented 3 years ago

GTFS Time is not defined relative to midnight, but relative to noon - 12h. While that makes "writing" GTFS feeds easier, it makes processing a lot harder.

Expected functionality

As explained in my note about GTFS Time values, with the Europe/Berlin time zone (+1h standard time to +2 DST shift occurs at 2021-03-28T02:00+01:00), I expect

Describe the bug

I'm very inexperienced with R and not that familiar with this code base, but it seems that tidytransit is affected by this problem on those days that the DST <-> standard time switch occurs on.

I'm not sure how that actually manifests in tidytransit's output, but I assume that wrong delays will be calculated, or that realtime data can't be matched against static data.

I tried to find some places in the code base:


related: https://github.com/google/transit/pull/15

derhuerst commented 3 years ago

If this is not the case, please excuse me for the noise, and close this Issue!

derhuerst commented 3 years ago

same (potential) problem in other libs:

related:

mpadge commented 3 years ago

Thanks @derhuerst, that's all very concerning! I don't have time right now to look further, but will definitely be keeping an eye on here to see where this takes us. It must also affect gtfsrouter, because i don't explicitly take care of that kind of situation. A general solution would be really helpful for all.

polettif commented 3 years ago

Thanks, I have some trouble wrapping my head around it at the moment. However, as an initial reaction I'd say this is more of an issue for feed providers than consumers. tidytransit does not use UNIX datetime values (like 2021-03-28T00:30+01:00) only hms values (00:30:00 without date information). This might be a problem in filter_feed_by_date where we don't include trips running past midnight from previous dates.

Generally, I haven't really run into issues with the feeds I work with (well, we rarely focus on saturday nights). Also, I'd really like to avoid adding timezones in tidytransit...

derhuerst commented 3 years ago

Thanks, I have some trouble wrapping my head around it at the moment.

I can relate. ๐Ÿ˜„ I keep forgetting the details of this weird date/time handling after a while, and have to get familiar with it again.

However, as an initial reaction I'd say this is more of an issue for feed providers than consumers.

Both, right? Providers/authors need to know about this when transferring their timetables into GTFS, and consumers need to know about it when processing GTFS.

tidytransit does not use UNIX datetime values (like 2021-03-28T00:30+01:00) only hms values (00:30:00 without date information).

I'm not sure about frequencies.R, but raptor.R works with absolute times, right? How would you otherwise handle routing across date "boundaries" (a.k.a. routing across trips defined with difference service days).

This might be a problem in filter_feed_by_date where we don't include trips running past midnight from previous dates.

Sounds like another issue.

Generally, I haven't really run into issues with the feeds I work with (well, we rarely focus on saturday nights).

I made the experience, that this kind of bug is very hard to come across in practice, but at least in Germany, there are many trips running across the DST <-> standard time boundaries.

Another problem is that so many public-transport-related software has buggy date/time handling, even widely use ones, so it's hard to identify the intended behaviour.

Also, I'd really like to avoid adding timezones in tidytransit...

In this case, I'd propose to clearly document this shortcoming somewhere, so that users of tidytransit are aware of the consequences.

mpadge commented 3 years ago

Totally agree on your concluding statement @derhuerst!

polettif commented 3 years ago

Both, right? Providers/authors need to know about this when transferring their timetables into GTFS, and consumers need to know about it when processing GTFS.

Yes, of course. I guess my point is that tidytransit should just handle feeds "as is" and not try to fix incorrect data. I'm still not sure if that point really applies though ๐Ÿ˜…

I'm not sure about frequencies.R, but raptor.R works with absolute times, right? How would you otherwise handle routing across date "boundaries" (a.k.a. routing across trips defined with difference service days).

Well, it doesn't handle routing across date boundaries... However, that's not explicitly documented. The date used in filter_feed_by_date isn't the actual date but the date a trip belongs to ("Betriebstag"). It filters stop_times.txt by service_ids running on that day, there's no padding with trips running on service dates before or after. A new function like filter_feed_by_timeframe (with datetimes as limits) could cover that. But as you said, that's another issue.

I made the experience, that this kind of bug is very hard to come across in practice, but at least in Germany, there are many trips running across the DST <-> standard time boundaries.

They exist in Switzerland as well, there are actually more than 6000 trips running between 00:00 and 03:00 on October 31st this year (assuming my calculations in this gist are correct).

In this case, I'd propose to clearly document this shortcoming somewhere, so that users of tidytransit are aware of the consequences.

Definitely. A lot of this issue comes down to properly documenting assumptions and intended behavoir.

derhuerst commented 3 years ago

Both, right? Providers/authors need to know about this when transferring their timetables into GTFS, and consumers need to know about it when processing GTFS.

Yes, of course. I guess my point is that tidytransit should just handle feeds "as is" and not try to fix incorrect data. I'm still not sure if that point really applies though ๐Ÿ˜…

Just to prevent a misunderstanding: A GTFS provider can author perfectly valid GTFS data that follows GTFS's definition of day-relative time, and AFAICT tidytransit wouldn't parse it correctly; This is not about handling the feed "as is", as there is no other reasonable assumption for tidytransit to make than assuming that a GTFS feed follows the GTFS spec.

If you're talking about programmatically "guessing" that the provider/author of the GTFS feed violated the spec by using a more intuitive definition of day-relative time: Let's not do that. ๐Ÿ˜„

tbuckl commented 8 months ago

would be great if we could link to a concise and clear explanation of this bug with demonstrated impact for users.

derhuerst commented 6 months ago

We could add this to https://gist.github.com/derhuerst/574edc94981a21ef0ce90713f1cff7f6. You're very welcome to propose in a comment on what to add to it!