skinkie / reference

Personal repository where I collect working examples to understand inner workings while building PyNeTExConv
GNU Affero General Public License v3.0
0 stars 1 forks source link

loading of times not correct in duckdb #6

Closed ue71603 closed 2 months ago

ue71603 commented 2 months ago

Using https://transport.data.gouv.fr/datasets/agregat-oura the import stops, because a field is out of range

Traceback (most recent call last):
  File "C:\Users\ue71603\MG_Daten\github\reference\gtfs-netex-test\import.py", line 4, in <module>
    with duckdb.cursor(con) as cur:
  File "C:\Users\ue71603\MG_Daten\github\reference\gtfs-netex-test\import.py", line 25, in <module>
    cur.execute("""CREATE TABLE stop_times AS SELECT * FROM read_csv('gtfs/stop_times.txt', delim=',', header=true, auto_detect=true);""")
duckdb.duckdb.ConversionException: Conversion Error: CSV Error on Line: 18238
Error when converting column "arrival_time".
time field value out of range: "24:10:00", expected format is ([YYYY-MM-DD ]HH:MM:SS[.MS])

However, this is a legal GTFS value https://gist.github.com/derhuerst/574edc94981a21ef0ce90713f1cff7f6

skinkie commented 2 months ago

The cause is different. DuckDB probes the column als SQL TIME, since no explicit table definition is defined. It is not explicitly defined because the order of elements between GTFS files may differ. A discussion on the subject was started to provide the types for column headers and have column headers as must, or have types as optional.

Obviously you could read the first line of the Python file to establish which columns are present in what order, then create the table definition based on the order used in the file, but using the types defined by the standard.

ue71603 commented 2 months ago

When a file like shape.txt is missing then import.py aborts

skinkie commented 2 months ago

Not related to this issue.