observatoire-mobilite / odmkraken

The **kraken** is the orchestration layer responsible for gathering and postprocessing ODM's mobility data
MIT License
0 stars 0 forks source link

Parallel import of multiple ICTS files fails #4

Closed gilgeorges closed 1 year ago

gilgeorges commented 2 years ago

Launching several ICTS jobs at the same time results in failure in adjust_date_format, presumably as one job starts modifying the raw_ data table already being modified by another. Observed error messages were:

psycopg2.errors.DuplicateColumn: column "zeit" of relation "raw_data" already exists
CONTEXT:  SQL statement "alter table "vehdata".raw_data add column zeit timestamp"
PL/pgSQL function vehdata.adjust_format(character varying,integer) line 3 at SQL statement

And:

psycopg2.errors.UndefinedColumn: column "zeit" does not exist
LINE 6:             min(zeit) as first_seen,
                        ^
QUERY:  with new_vehicles as (
    insert into "vehdata".vehicles (code, plate, first_seen, original_code)
        select
            substring("FAHRZEUG" from '(\d+)(?: *- *(?:[A-Z]+ *\d+|\w+))?')::integer,
            regexp_replace(substring("FAHRZEUG" from '\d+(?: *- *([A-Z]+ *\d+|\w+))?'), '([A-Z]{2}) *(\d+)', '\1\2'),
            min(zeit) as first_seen,
            "FAHRZEUG" as original_code
        from "vehdata".raw_data
        where "FAHRZEUG" is not null --and nextval('"vehdata".prog_meter') != 0
        group by "FAHRZEUG"
    on conflict (code, plate) do nothing
    returning id as veh_id, code as veh_code, plate as veh_plate
    ) select * from new_vehicles
CONTEXT:  PL/pgSQL function vehdata.extract_vehicles() line 3 at RETURN QUERY
gilgeorges commented 2 years ago

Diagnosis: this is most likely the result of using one hard-coded staging table. If so, there are two solutions:

  1. The quick-fix: only allow one import at a time
  2. The long-term solution: use a dynamic table name
gilgeorges commented 1 year ago

Just ran two import jobs in parallel. The first finished after about 1h without the second crashing - it is still going. So parallel imports are definitely possible.

However, import times were much longer than usual. Next to the parallel imports themselves, this might be due to:

  1. My MacBook I tested this on being low on resources, especially free disk space, and this scarcity worsening over the course of the parallel imports
  2. The database itself growing, and thus import routines potentially slowing down - although this just concerns the insert statements, which did not seem to slow down before, but this is testable.

So to not forget about this issue, I am adding #36