nextstrain / flu_frequencies

Flu clade and mutation frequencies
https://flu-frequencies.vercel.app
3 stars 3 forks source link

Unexpected polars panic exception when creating day count column in `fit_single_frequencies.py` #34

Closed huddlej closed 9 months ago

huddlej commented 9 months ago

Current Behavior

When the flu frequencies workflow gets to the fit_single_frequencies.py step, recent versions of polars throw a panic exception with the following error message:

$ python scripts/fit_single_frequencies.py             --metadata data/vic/combined_na.tsv             --geo-categories region             --frequency-category clade             --min-date 2021-01-01             --days 14             --inclusive-clades flu             --output-csv results/vic_na/region-frequencies.csv
thread '<unnamed>' panicked at crates/polars-core/src/series/iterator.rs:74:9:
assertion `left == right` failed: impl error
  left: 4
 right: 1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
--- PyO3 is resuming a panic after fetching a PanicException from Python. ---
Python stack trace below:
Traceback (most recent call last):
  File "/Users/jlhudd/miniconda3/envs/flu_frequencies/lib/python3.10/site-packages/polars/expr/expr.py", line 3976, in __call__
    result = self.function(*args, **kwargs)
  File "/Users/jlhudd/miniconda3/envs/flu_frequencies/lib/python3.10/site-packages/polars/expr/expr.py", line 4299, in wrap_f
    return x.map_elements(
  File "/Users/jlhudd/miniconda3/envs/flu_frequencies/lib/python3.10/site-packages/polars/series/series.py", line 5270, in map_elements
    self._s.apply_lambda(function, pl_return_dtype, skip_nulls)
pyo3_runtime.PanicException: assertion `left == right` failed: impl error
  left: 4
 right: 1
Traceback (most recent call last):
  File "/Users/jlhudd/projects/nextflu-reports/who-2024-02/flu_frequencies/scripts/fit_single_frequencies.py", line 163, in <module>
    data, totals, counts, time_bins = load_and_aggregate(d, args.geo_categories, freq_cat,
  File "/Users/jlhudd/projects/nextflu-reports/who-2024-02/flu_frequencies/scripts/fit_single_frequencies.py", line 44, in load_and_aggregate
    d = d.with_columns([pl.col('date').map_elements(lambda x: to_day_count(x, start_date)).alias("day_count")])
  File "/Users/jlhudd/miniconda3/envs/flu_frequencies/lib/python3.10/site-packages/polars/dataframe/frame.py", line 8270, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
  File "/Users/jlhudd/miniconda3/envs/flu_frequencies/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1730, in collect
    return wrap_df(ldf.collect())
pyo3_runtime.PanicException: assertion `left == right` failed: impl error
  left: 4
 right: 1

I don't see any obvious changes to our input data between when this used to work and now. Downgrading polars to 0.20.3 allows the frequencies script to run without an error, suggesting that the issue first appeared in polars 0.20.4 (release Jan 12, 2024). This is all with Python 3.10.13 on an Intel Mac (OS version 12.6).

I confirmed that the error only occurs when calling the map_elements section of the failing expression above.

Possible solutions

As a band-aid, we could pin polars to 0.20.3 in the Conda environment.

As a longer-term solution, we might try to replace the officially discouraged map_elements call with a different approach.

Or we could switch to pandas.

rneher commented 9 months ago

the pin works for now. I am open to rethink the polars experiment :)