planarnetwork / dtd2mysql

MySQL / MariaDB import for DTD feeds (fares, timetable and routeing)
30 stars 10 forks source link

Missing calendar entries for schedule after applying association #1

Closed mk-fg closed 7 years ago

mk-fg commented 7 years ago

Hey,

Implemented associations in mk-fg/open-track-dtd2mysql-gtfs, and running it and dtd2mysql on the same data, noticing quite a few differences in schedules with associations, which look like a bug in dtd2mysql.

In particular, with all schedules for these trains from my data (which have assocs with each other): C73290 C74089 C74104 C74105

Trips/stop_times match, and calendars match between implementations for all trips but this one:

EUS[21:15] - WFJ[21:32/21:33] - CRE[23:53/23:56] - WBQ[1+00:22/1+00:24] - PRE[1+00:58/1+01:00] -
CAR[1+02:24] - EDB[1+03:57/1+04:43] - INK[1+05:01] - KDY[1+05:20/1+05:21] - LEU[1+05:48/1+05:49]
- DEE[1+06:11] - CAN[1+06:25/1+06:26] - ARB[1+06:34] - MTS[1+06:50/1+06:52] -
STN[1+07:15/1+07:17] - ABD[1+07:39]

For which py gives 142 running days and ts only 10:

gtfs_py:
  <TS ....5.. [2017-05-26 2017-12-08] {2017-08-25}>
  <TS .23.... [2017-12-05 2017-12-06] {}>
  <TS 1...... [2017-05-22 2017-12-04] {2017-08-28}>
  <TS .234... [2017-05-23 2017-11-30] {}>

gtfs_ts:
  <TS .23.... [2017-12-05 2017-12-06] {}>
  <TS 1...... [2017-06-05 2017-06-12] {}>
  <TS .234... [2017-06-06 2017-06-15] {}>

(format is: <TS $weekdays [$date_start $date_end] {$exception_dates}>)

Checking e.g. where <TS ....5.. [2017-05-26 2017-12-08] ...> came from in py output, there're these overlapping entries not cancelled by anything (mismatched weekdays due to "over next-midnight" association):

C74089   41627 P 2017-05-27 2017-12-09 .....6.
C74104   39484 P 2017-05-26 2017-12-08 ....5..

And relevant association entry seem to be:

C74104 C74089     370 P VV N EDB 2017-05-26 2017-12-08 ....5..

So there should be "C74104_C74089" association trip(s) for this timespan in TS gtfs output, right?

What's also interesting about C74089 is that it has this schedule:

C74089   39294 P 2017-05-24 2017-12-01 ..345..

Which would overlap with C74104_C74089 one, created via next-midnight association from ".....6.", as it'd start on previous day from stops in C74104 schedule, if I'm not mixing anything up.

Wonder if maybe that's what might be causing the issue here.

Just in case, here's a link to somewhat unnecessary long diff with all the db entries (from my CIF data): https://gist.github.com/mk-fg/f4da61e5d753be55870e99949d9752bc

linusnorton commented 7 years ago

Thanks for reporting in such a detailed way. I've confirmed there is a bug now I just need to track it down.

linusnorton commented 7 years ago

The issue appears to be have been caused by shifting the date of the associated service to match the base service.

Using the association between C74089 and C74104 on <TS ....5.. [2017-05-26 2017-12-08] {2017-08-25}> as an example, the associated service runs <TS .....6. [2017-05-27 2017-12-09] {2017-08-26} when it is split with the base service it adopts the base services date range. Then another overnight association record running only on 2017-08-25 tries to find matching the associated schedules on 2017-08-26 (as it's overnight) and the schedule we've just split is returned because the exclude day was moved forward to 2017-08-26.

The solution is to either not modify the dates of associated schedules until after all associations have been processed or store them in a different index so they are not returned twice. For now I have moved them to another index as I cannot see any situations where an associated schedule splits/joins twice. There are situations where the base schedule splits or joins twice. Probably worth looking into.

linusnorton commented 7 years ago

Not very happy with the quality of the code for the fix but it seems to work. Please verify when you can.

mk-fg commented 7 years ago

The solution is to either not modify the dates of associated schedules until after all associations have been processed or store them in a different index

Guess I've managed to side-step this issue by choosing to stick with iterators, where processed schedules are just yielded through and can never be reused that way.

There are situations where the base schedule splits or joins twice. Probably worth looking into.

I have only one such case (multiple base_uid's for same assoc_uid) in my data and it's not a problem, because all these base_uid's have no schedules, and appear to be EuroStar trains or something like that, which I suspect is why they fall through the cracks in whatever generates these records.

To avoid having this kind of "complex graph" worries, I have two sanity checks in my code:

As long as neither of these get triggered (and they don't for my data), should be simple "some trains are base_uid's, with bunch of others joining/splitting to/from them" case, no complex A-to-B-to-C association graphs or anything like that.

Such checks are cheap too, given that there're only ~2K assoc records anyway, so maybe worth having here as well.

mk-fg commented 7 years ago

If there are multiple base_uid's for same assoc_uid's (also among all records)

Correction: multiple different base_uid's for same assoc_uid and date.

Don't think there should be a problem in either implementation if train joins/splits from one base_uid on some days, and another base_uid on others.

mk-fg commented 7 years ago

There are situations where the base schedule splits or joins twice. Probably worth looking into.

With latest timetable data from atoc (ttis639.zip), bumped into one other case of this (tripped sanity-check):

cif associations:
  C80455 C81159     356 P VV - NOT 2017-05-27 2017-12-09 .....6.
  ...
  C81159 C81585     506 N JJ - NOT 2017-06-24 2017-10-07 .....6.
  ...

Here simple "Same train_uid is never used as both base_uid and assoc_uid among all association records [on the same day]" from above doesn't hold true, but C81585 have no schedules, same as with "multiple base_uid for same train" case.

So there should be no issues when processing these with current approach either, just wanted to note that there apparently are such cases as well.