Parallel import of multiple ICTS files fails

gilgeorges commented 2 years ago

Launching several ICTS jobs at the same time results in failure in adjust_date_format, presumably as one job starts modifying the raw_ data table already being modified by another. Observed error messages were:

psycopg2.errors.DuplicateColumn: column "zeit" of relation "raw_data" already exists
CONTEXT:  SQL statement "alter table "vehdata".raw_data add column zeit timestamp"
PL/pgSQL function vehdata.adjust_format(character varying,integer) line 3 at SQL statement

And:

psycopg2.errors.UndefinedColumn: column "zeit" does not exist
LINE 6:             min(zeit) as first_seen,
                        ^
QUERY:  with new_vehicles as (
    insert into "vehdata".vehicles (code, plate, first_seen, original_code)
        select
            substring("FAHRZEUG" from '(\d+)(?: *- *(?:[A-Z]+ *\d+|\w+))?')::integer,
            regexp_replace(substring("FAHRZEUG" from '\d+(?: *- *([A-Z]+ *\d+|\w+))?'), '([A-Z]{2}) *(\d+)', '\1\2'),
            min(zeit) as first_seen,
            "FAHRZEUG" as original_code
        from "vehdata".raw_data
        where "FAHRZEUG" is not null --and nextval('"vehdata".prog_meter') != 0
        group by "FAHRZEUG"
    on conflict (code, plate) do nothing
    returning id as veh_id, code as veh_code, plate as veh_plate
    ) select * from new_vehicles
CONTEXT:  PL/pgSQL function vehdata.extract_vehicles() line 3 at RETURN QUERY

gilgeorges commented 2 years ago

Diagnosis: this is most likely the result of using one hard-coded staging table. If so, there are two solutions:

The quick-fix: only allow one import at a time
The long-term solution: use a dynamic table name

gilgeorges commented 1 year ago

Just ran two import jobs in parallel. The first finished after about 1h without the second crashing - it is still going. So parallel imports are definitely possible.

However, import times were much longer than usual. Next to the parallel imports themselves, this might be due to:

My MacBook I tested this on being low on resources, especially free disk space, and this scarcity worsening over the course of the parallel imports
The database itself growing, and thus import routines potentially slowing down - although this just concerns the insert statements, which did not seem to slow down before, but this is testable.

So to not forget about this issue, I am adding #36

observatoire-mobilite / odmkraken

Parallel import of multiple ICTS files fails #4