observatoire-mobilite / odmkraken

The **kraken** is the orchestration layer responsible for gathering and postprocessing ODM's mobility data
MIT License
0 stars 0 forks source link

`extract_from_csv` fails with `IndexError` on `infer_format` #5

Closed gilgeorges closed 1 year ago

gilgeorges commented 1 year ago

File 20201216180001_LUXGPS.csv.zip produced the following error message:

The above exception was caused by the following exception:
IndexError: list index out of range
  File "C:\srv\odm-py\0.0.3\lib\site-packages\dagster\_core\execution\plan\utils.py", line 47, in solid_execution_error_boundary
    yield
  File "C:\srv\odm-py\0.0.3\lib\site-packages\dagster\_utils\__init__.py", line 421, in iterate_with_context
    next_output = next(iterator)
  File "C:\srv\odm-py\0.0.3\lib\site-packages\dagster\_core\execution\plan\compute_generator.py", line 73, in _coerce_solid_compute_fn_to_iterator
    result = fn(context, **kwargs) if context_arg_provided else fn(**kwargs)
  File "C:\srv\odm-py\0.0.3\lib\site-packages\odmkraken\busspeeds\extract.py", line 53, in extract_from_csv
    format = infer_format(handle)
  File "C:\srv\odm-py\0.0.3\lib\site-packages\odmkraken\busspeeds\extract.py", line 162, in infer_format
    datum = lines[1].split(format['sep'])[i_datum]
gilgeorges commented 1 year ago

Analysis

Turns out C:\srv\upload\20201216180001_LUXGPS.csv.zip has a size of 99 bytes (reported as 1 kB by Windows Explorer ?!?) and contains only the header line:

TYP;DATUM;SOLLZEIT;ZEIT;FAHRZEUG;LINIE;UMLAUF;FAHRT;HALT;LATITUDE;LONGITUDE;EINSTEIGER;AUSSTEIGER

Diagnosis

This is a logic problem: the code expects a file to either be contain data in the right format, or be malformed. Thus it reads three lines of text from the presumed CSV file and checks for a header an the proper data format. Turns out that the instruction used to read those three lines of text does not fail even if the file contains less than three lines:

lines = [handle.readline().decode('utf-8').strip() for i in range(3)]

The file supplied here contains a proper header line, so the format check passes. When it then goes to access lines[1] to check the date format on the first line of data, it fails with a very unintuitive IndexError.

Observation

Fix

gilgeorges commented 1 year ago

It turns out that calling readline repeatedly on an empty file returns an empty string each. My diagnosis might thus be wrong. Except if the behavior on ZipFile is somehow different. In that context, I observed an oddity: read lines are decoded to str explicitly, suggesting binary read?!?

gilgeorges commented 1 year ago

Seems I am right, the second time: the IndexError seems rather due to accessing the i_datum-th field of an empty data array.

gilgeorges commented 1 year ago

No longer fails now, and the live-tests introduced with #24 should ensure that I see if it starts happening again.