pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.58k stars 1.98k forks source link

Improve error message when trying to pass numbers (int, float) as schema column names #19507

Open linamnt opened 1 month ago

linamnt commented 1 month ago

Description

I would like to set the column values of a DataFrame from a list of numbers designating the time bin of the data (i.e. for a histogram). This does not currently seem possible, and instead I have to cast the values as strings for this to work. This is due to the following line of code which assumes non-string column names are indexable. See below for an example. Or perhaps better error messaging...

import polars as pl
import numpy as np

trial_start_time = -0.05
trial_end_tiem = 0.05
time_bins = np.arange(trial_start_time, trial_end_time, 0.01)
arr = np.zeros([2, len(time_bins)-1])

# does not work
df = pl.DataFrame(arr, schema=list(np.round(time_bins[:-1], 3)))              
# does work
df = pl.DataFrame(arr, schema=list([f"{t}" for t in np.round(time_bins[:-1], 3)]))   

I get the following error indicating that when the the column name is not a string, it tries to index the column name and take the first element which of course we can't do as a float.

/usr/local/lib/python3.10/dist-packages/polars/_utils/construction/dataframe.py in _unpack_schema(schema, schema_overrides, n_expected, lookupnames) 230 col = f"column{i}" if unnamed else col 231 else: --> 232 col = col[0] 233 column_names.append(col) 234

IndexError: invalid index to scalar variable.

Thanks for your consideration!

MarcoGorelli commented 1 month ago

thanks for the request - I don't think there's any chance of non-string column names being allowed, but

Or perhaps better error messaging...

seems addressable

orlp commented 4 weeks ago

There's 0 chance we'll allow anything except strings as column names. I'll edit the title to state the error message should be improved.

linamnt commented 4 weeks ago

Just to clarify, I was not suggesting for the column name to be anything else but a string, but that the handling of arrays of numbers or other lists of items that can be cast as str could be turned into strings in the background.

But either way, seems the best solution is for the error message to be more informative! Thanks!