Closed pipliggins closed 1 week ago
There is an issue with this mode and polars/parquet. If I amend parse_adtl() to read/write using csv then it works fine, but using the parquet interface it breaks as follows (I would consider this non-critical in the short-term as we can use csv, but worth investigating further / marking an issue / etc.; unless you know how to fix it now):
I tried the parser on D1 (you get better error reporting by calling the parser directly after adding the following code for quick testing):
if __name__ == "__main__":
df = pd.read_excel("data.xlsx")
print(parse(df))
The error is (some filenames removed):
[---] parsing tmpfgbmk_ow.csv: 4217it [00:00, 14305.54it/s]
Traceback (most recent call last):
File "---", line 14, in <module>
print(parse(df))
^^^^^^^^^
File "---", line 10, in parse
return parse_adtl(df, spec_file, ["linelist"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "---/InsightBoard/src/InsightBoard/parsers/__init__.py", line 40, in parse_adtl
parsed.write_parquet(table_name, parsed_temp_file.name)
File "---/python3.12/site-packages/adtl/__init__.py", line 961, in write_parquet
df = pl.DataFrame(data, infer_schema_length=len(data))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "---/python3.12/site-packages/polars/dataframe/frame.py", line 374, in __init__
self._df = sequence_to_pydf(
^^^^^^^^^^^^^^^^^
File "---/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 460, in sequence_to_pydf
return _sequence_to_pydf_dispatcher(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "---/python3.12/functools.py", line 907, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "---/python3.12/site-packages/polars/_utils/construction/dataframe.py", line 712, in _sequence_of_dict_to_pydf
pydf = PyDataFrame.from_dicts(
^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: could not append value: ["crust"] of type: list[str] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`
it might also be that a value overflows the data-type's capacity
@pipliggins It can also sometimes return "No matches found for: 'Non'" (for example), instead of just 'Non'. i.e. this is what appears in the data table.
@pipliggins It can also sometimes return "No matches found for: 'Non'" (for example), instead of just 'Non'. i.e. this is what appears in the data table.
@jsbrittain pushed a commit that changes this behaviour
There is an issue with this mode and polars/parquet. If I amend parse_adtl() to read/write using csv then it works fine, but using the parquet interface it breaks as follows (I would consider this non-critical in the short-term as we can use csv, but worth investigating further / marking an issue / etc.; unless you know how to fix it now):
polars.exceptions.ComputeError: could not append value: ["crust"] of type: list[str] to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`
Paraphrasing from Teams: This happens because when you return unmapped values, they are often the 'wrong' type according to the schema. E.g. 'eight' cannot be converted to a float value, so when returned unconverted a string ends up in the supposedly float
-typed age_years
column. Parquet requires each column to be of a single type, hence the error. We will therefore use the original csv format to pass data from adtl into InsightBoard; this will require some re-working of the validation against a json schema outside of ADTL (and therefore outside the scope of this PR/repo).
Looks good. I haven't extensively tested it but the 'no matches found' error appears to have been resolved in the datatable, and we're aware of the parquet issue now (perhaps a note if you try and export parquet with
returnUnmatched
settrue
that it is likely to fail?)
Thanks! I've added a note to the specification file about it, but planning on adding a check to the CLI for parquet+returnUnmatched before merging.
Add setting to return values when they aren't converted or mapped.