public-transport / gtfs-via-postgres

Process GTFS Static/Schedule by importing it into a PostgreSQL database.
https://github.com/derhuerst/gtfs-via-postgres#gtfs-via-postgres
Other
88 stars 17 forks source link

with routes.agency_id = null, t_departure is always null #45

Closed langbein-daniel closed 1 year ago

langbein-daniel commented 1 year ago

Sorry for confronting you with yet another issue/pull request message this weekend ^^


I noticed that in my arrivals_departures view, there are no t_departure values. The following statement gives 0 results:

SELECT date, departure_time, t_departure
FROM arrivals_departures
WHERE t_departure IS NOT NULL
LIMIT 100

As written in the README, this field is calculated based on the date and departure_time fields. In the code I saw, that the agency_timezone is used as well (if no stations or stop timezone is given).

My agency.txt has just one agency where agency_timezone is specified as "Europe/Berlin".

And I do have date and departure_time values in the arrivals_departures view:

SELECT date, departure_time, t_departure, route_short_name, stop_name
FROM arrivals_departures
WHERE route_short_name = 'S1'
ORDER BY date, departure_time
LIMIT 100
   1   │         date         | departure_time | t_departure | route_short_name |         stop_name          
   2   │ ---------------------+----------------+-------------+------------------+----------------------------
   3   │  2022-12-11 00:00:00 | 05:08:00       |             | S1               | Bamberg
   4   │  2022-12-11 00:00:00 | 05:13:00       |             | S1               | Strullendorf
   5   │  2022-12-11 00:00:00 | 05:16:00       |             | S1               | Hirschaid
   6   │  2022-12-11 00:00:00 | 05:19:00       |             | S1               | Buttenheim
   7   │  2022-12-11 00:00:00 | 05:22:00       |             | S1               | Eggolsheim

Any ideas what could cause this? I am happy to provide more information. I used the latest Docker image if your program together with the postgis/postgis image.

derhuerst commented 1 year ago

My agency.txt has just one agency where agency_timezone is specified as "Europe/Berlin".

Are there other agencies in agency (which would require you to remove the NOT NULL contraint first)? Or is there only one agency?

Can you provide the feed, so that I can try to reproduce this?

langfingaz commented 1 year ago

There is only one agency. The feed can be downloaded here: https://www.vgn.de/opendata/GTFS.zip

derhuerst commented 1 year ago

This is a bug. Both arrivals_departures and connections assume that routes.agency_id is not NULL, because it expects to be able to JOIN agency on it.

I'm not sure yet what a good fix will look like; I will investigate the performance trade-offs involved with JOINing the first row if agency_id is NULL.

For now, a good workaround is to patch routes.agency_id before importing, e.g. using e.g. qsv/xsv.

derhuerst commented 1 year ago

@langbein-daniel Do you want to mentioned as a contributor? If you do, how? (This metadata will probably be included in various public archives for a long time.)

langbein-daniel commented 1 year ago

I am glad that I could help you find a bug! Many thanks for having a look and fixing it so quickly.

A mention as a contributor would be nice, but please don't make too much effort about it ;)

{
  "name" : "Daniel Langbein",
  "email" : "daniel@systemli.org",
}
langbein-daniel commented 1 year ago

Even if you have added a test case to detect an invalid agency.txt file (with two agencies but no agency_id), a valid agency.txt with just one entry but no agency_id does currently fail. One sees the new error message The GTFS spec allows routes.agency_id to be empty/null only if there is exactly one agency in the feed..


I added console.log('NumAgencies: ' + workingState.nrOfRowsByName.get('agency')) just before the new check https://github.com/public-transport/gtfs-via-postgres/blob/02d307b66cd6fc1768ffe5f8c8a9176f96a3c0c1/lib/routes.js#L299 and it prints:

NumAgencies: undefined

Here is a screenshot of the IntelliJ debugger:

image

So I think the problem is that agency.txt has not yet been read. This is probably easy to correct, but I don't know this project well enough to suggest a fix.

Note: I tested this with the current git version of this project, again with the same GTFS feed of the VGN.

derhuerst commented 1 year ago

[…] a valid agency.txt with just one entry but no agency_id does currently fail. One sees the new error message The GTFS spec allows routes.agency_id to be empty/null only if there is exactly one agency in the feed..

On main (21aad03), I cannot reproduce this problem, except when running with --routes-without-agency-id.

The latter should of course work too, lib/routes.js is missing a !opt.routesWithoutAgencyId check. Will push another fix for this.


I added console.log('NumAgencies: ' + workingState.nrOfRowsByName.get('agency')) just before the new check […] and it prints NumAgencies: undefined. So I think the problem is that agency.txt has not yet been read. […]

This happens because gtfs-via-postgres determines the order of files to be processed based on each file's dependencies on others (e.g. stop_times on trips). But if --routes-without-agency-id is passed (and opt.routesWithoutAgencyId is therefore true), lib/deps.js does not specify this dependency, causing agency to be processed after routes, causing the workingState.nrOfRowsByName Map not to have an agency entry.

https://github.com/public-transport/gtfs-via-postgres/blob/21aad0358a7ed2abe57c07cce5bd6113786f2318/lib/deps.js#L30


I have added the --routes-without-agency-id flag in e97a3fde1052dc452562afecaea1ebd3bb455eae, but I don't remember why, given that both files agency and routes are required by the GTFS spec.

Maybe as a cheap (and half-broken) workaround around the exact use case you've brought up in this Issue? Does this mean that we can remove --routes-without-agency-id, because there is now a proper logic for empty routes.agency_id in place?