public-transport / gtfs-via-postgres

Process GTFS Static/Schedule by importing it into a PostgreSQL database.
https://github.com/derhuerst/gtfs-via-postgres#gtfs-via-postgres
Other
88 stars 17 forks source link

ShapeId is not a unique identifier #20

Closed adrianleh closed 2 years ago

adrianleh commented 2 years ago

When importing a dataset where trips have shapes that are referenced the following generated line the in SQL causes an error

FOREIGN KEY (shape_id) REFERENCES shapes (shape_id),

since shape_id in the table shapes is defined as regular non-unique TEXT

CREATE TABLE shapes (
    id SERIAL PRIMARY KEY,
    shape_id TEXT,
    shape_pt_sequence INT,
    shape_pt_loc geography(POINT),
    shape_dist_traveled REAL
);

which results in the following error upon generation

ERROR:  there is no unique constraint matching given keys for referenced table "shapes"
STATEMENT:  CREATE TABLE trips (
    trip_id TEXT PRIMARY KEY,
    route_id TEXT NOT NULL,
    FOREIGN KEY (route_id) REFERENCES routes,
    service_id TEXT NOT NULL, 
    trip_headsign TEXT,
    trip_short_name TEXT,
    direction_id INT,
    block_id TEXT,
    shape_id TEXT,
    FOREIGN KEY (shape_id) REFERENCES shapes (shape_id),
    wheelchair_accessible wheelchair_accessibility,
    bikes_allowed bikes_allowance
);

When using the --trips-without-shape-id it resolves the error, however does not generate the shape table at all which is not desirable.

derhuerst commented 2 years ago

The problem is that each row in a GTFS shapes.txt file does not have a unique ID; Rather, a shape (including its unique ID) is formed by combining all rows with an equal shape_id.

Therefore, I cannot make shape_id unique as long as I import shapes.txt as-is row-by-row. So I'm not sure how to fix this.

When using the --trips-without-shape-id it resolves the error, however does not generate the shape table at all [...].

I cannot reproduce this.

wget -r --no-parent --no-directories --compression auto -R .csv.gz -P vbb-2022-03-25 -N 'https://vbb-gtfs.jannisr.de/2022-03-25/'
gtfs-to-sql -d --trips-without-shape-id vbb-2022-03-25/*.csv | sponge | psql -b

psql -c 'SELECT * FROM (SELECT DISTINCT ON (shape_id) shape_id, COUNT(shape_pt_sequence) OVER (PARTITION BY shape_id) AS nr_of_points FROM shapes) t ORDER BY nr_of_points DESC LIMIT 10'
#  shape_id | nr_of_points 
# ----------+--------------
#  2172     |         3547
#  2155     |         3539
#  2147     |         3451
#  2162     |         3424
#  2157     |         3423
#  2161     |         3423
#  2169     |         3327
#  11748    |         3283
#  11798    |         3283
#  11516    |         3265
# (10 rows)
adrianleh commented 2 years ago

I am using the following dataset: https://www.transitchicago.com/downloads/sch_data/google_transit.zip Note that you might need to manually fix agency.txt to include an agency_id

derhuerst commented 2 years ago

When using the --trips-without-shape-id it resolves the error, however does not generate the shape table at all [...].

I cannot reproduce this.

I am using the following dataset: https://www.transitchicago.com/downloads/sch_data/google_transit.zip

Still can't reproduce.

Are you directly piping gtfs-to-sql into psql? Because there's a weird (but unrelated) bug, with sponge in between, it works.

adrianleh commented 2 years ago

The exact command I am using is export PGHOST=localhost; export PGPORT=5432; export PGUSER={USER}; export PGPASSWORD={PASSWORD}; export PGDATABASE={DB}; npm exec -- gtfs-to-sql --require-dependencies --routes-without-agency-id -s -- gtfs/*.txt | psql -b (with the dataset above)

derhuerst commented 2 years ago

Can you try if the error is gone with sponge in between?

adrianleh commented 2 years ago

Same issue persists npm exec -- gtfs-to-sql --require-dependencies --routes-without-agency-id -s -- gtfs/*.txt | sponge GTFS.sql (set up psql variables) psql -b -f GTFS.sql

Would it help if I uploaded the resulting file? I'll warn you though it's ~750MB

For reference using latest docker image postgis/postgis for testing which is postgres 14.2 under the hood

derhuerst commented 2 years ago

Would it help if I uploaded the resulting file? I'll warn you though it's ~750MB

Maybe. Can you upload it gzipped, or at least zipped?

adrianleh commented 2 years ago

GTFS.sql.7z.002.gz GTFS.sql.7z.001.gz GTFS.sql.7z.006.gz GTFS.sql.7z.005.gz GTFS.sql.7z.004.gz GTFS.sql.7z.003.gz

Here you go. Since you can't upload files larger than 10MB and are then restricted to file types, I 7zipped into multiple files and then gzipped each of them (7z extensions are not allowed).

So to use it you unzip each file and then extract the part files with 7z

You can use the commands below

7z x GTFS.sql.7z.001.gz
7z x GTFS.sql.7z.002.gz
7z x GTFS.sql.7z.003.gz
7z x GTFS.sql.7z.004.gz
7z x GTFS.sql.7z.005.gz
7z x GTFS.sql.7z.006.gz
7z x GTFS.sql.7z.001
adrianleh commented 2 years ago

Any updates on this?

derhuerst commented 2 years ago

a1304a9 changes the constraint in trips to check if at least one shapes row with that shape_id exists.

derhuerst commented 2 years ago

Fix published as gtfs-via-postgres@4.2.0! 🎉