globaldothealth / adtl

Another data transformation language
https://adtl.readthedocs.io
MIT License
1 stars 0 forks source link

Add returnUnmatched to spec options #96

Closed pipliggins closed 1 week ago

pipliggins commented 2 weeks ago

Add setting to return values when they aren't converted or mapped.

jsbrittain commented 2 weeks ago

There is an issue with this mode and polars/parquet. If I amend parse_adtl() to read/write using csv then it works fine, but using the parquet interface it breaks as follows (I would consider this non-critical in the short-term as we can use csv, but worth investigating further / marking an issue / etc.; unless you know how to fix it now):

I tried the parser on D1 (you get better error reporting by calling the parser directly after adding the following code for quick testing):

if __name__ == "__main__":
    df = pd.read_excel("data.xlsx")
    print(parse(df))

The error is (some filenames removed):

[---] parsing tmpfgbmk_ow.csv: 4217it [00:00, 14305.54it/s]
Traceback (most recent call last):
  File "---", line 14, in <module>
    print(parse(df))
          ^^^^^^^^^
  File "---", line 10, in parse
    return parse_adtl(df, spec_file, ["linelist"])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/InsightBoard/src/InsightBoard/parsers/__init__.py", line 40, in parse_adtl
    parsed.write_parquet(table_name, parsed_temp_file.name)
  File "---/python3.12/site-packages/adtl/__init__.py", line 961, in write_parquet
    df = pl.DataFrame(data, infer_schema_length=len(data))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/python3.12/site-packages/polars/dataframe/frame.py", line 374, in __init__
    self._df = sequence_to_pydf(
               ^^^^^^^^^^^^^^^^^
  File "---/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 460, in sequence_to_pydf
    return _sequence_to_pydf_dispatcher(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/python3.12/functools.py", line 907, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "---/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 712, in _sequence_of_dict_to_pydf
    pydf = PyDataFrame.from_dicts(
           ^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not append value: ["crust"] of type: list[str] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity
jsbrittain commented 2 weeks ago

@pipliggins It can also sometimes return "No matches found for: 'Non'" (for example), instead of just 'Non'. i.e. this is what appears in the data table.

pipliggins commented 1 week ago

@pipliggins It can also sometimes return "No matches found for: 'Non'" (for example), instead of just 'Non'. i.e. this is what appears in the data table.

@jsbrittain pushed a commit that changes this behaviour

pipliggins commented 1 week ago

There is an issue with this mode and polars/parquet. If I amend parse_adtl() to read/write using csv then it works fine, but using the parquet interface it breaks as follows (I would consider this non-critical in the short-term as we can use csv, but worth investigating further / marking an issue / etc.; unless you know how to fix it now):

polars.exceptions.ComputeError: could not append value: ["crust"] of type: list[str] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

Paraphrasing from Teams: This happens because when you return unmapped values, they are often the 'wrong' type according to the schema. E.g. 'eight' cannot be converted to a float value, so when returned unconverted a string ends up in the supposedly float-typed age_years column. Parquet requires each column to be of a single type, hence the error. We will therefore use the original csv format to pass data from adtl into InsightBoard; this will require some re-working of the validation against a json schema outside of ADTL (and therefore outside the scope of this PR/repo).

pipliggins commented 1 week ago

Looks good. I haven't extensively tested it but the 'no matches found' error appears to have been resolved in the datatable, and we're aware of the parquet issue now (perhaps a note if you try and export parquet with returnUnmatched set true that it is likely to fail?)

Thanks! I've added a note to the specification file about it, but planning on adding a check to the CLI for parquet+returnUnmatched before merging.