pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.58k stars 1.98k forks source link

Allow `lit` to create complex types #9516

Open stinodego opened 1 year ago

stinodego commented 1 year ago

The behaviour of the lit expression is questionable when trying to create complex types (List, Array, Struct).

Consider the following behavior:

>>> pl.select(pl.lit([1, 2]))  # Trying to create a List literal from a Python list
shape: (2, 1)
┌─────┐
│     │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
└─────┘
>>> pl.select(pl.lit((1, 2)))  # Trying to create an Array literal from a Python tuple
shape: (2, 1)
┌─────┐
│     │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
└─────┘
>>> pl.select(pl.lit({"a": 1, "b": 2}))  # Trying to create a Struct literal from a Python dict
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stijn/code/polars/py-polars/polars/functions/lazy.py", line 1277, in lit
    return wrap_expr(plr.lit(item, allow_object))
ValueError: could not convert value "{'a': 1, 'b': 2}" as a Literal

I would instead expect the following:

>>> pl.select(pl.lit([1, 2]))
shape: (1, 1)
┌───────────┐
│ literal   │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 2]    │
└───────────┘
>>> pl.select(pl.lit((1, 2)))
shape: (1, 1)
┌───────────────┐
│ literal       │
│ ---           │
│ array[i64, 2] │
╞═══════════════╡
│ [1, 2]        │
└───────────────┘
>>> pl.select(pl.lit({"a": 1, "b": 2}))
shape: (1, 1)
┌───────────┐
│ literal   │
│ ---       │
│ struct[2] │
╞═══════════╡
│ {1,2}     │
└───────────┘

Is there a reason for the existing list behaviour? It seems very confusing.

If the new behaviour is acceptable, I'll see if I can implement this.

ritchie46 commented 1 year ago

A lit([1, 2, 3]) is not aSeries of List<i64> , but a Series of i64.

To create a list literal, you must have the same nesting as we have in a list lit([[1, 2, 3]]). Similar to how we create lists in our series constructor.

There is amiguity as you can create a Series literal with literal as well. This is done with is_in a lot.

I agree that it might be easier if we would consider that a list, but then we need to see what that breaks.

I am not sure a tuple should be an array though. : thinking:

alexander-beedie commented 1 year ago

I am not sure a tuple should be an array though

From the python perspective a tuple is merely an immutable list so, however they behave, they should probably behave the same 🤷

stinodego commented 1 year ago

To create a list literal, you must have the same nesting as we have in a list lit([[1, 2, 3]]). Similar to how we create lists in our series constructor.

The input to lit should be a single element of the data in a Series constructor.

  1. lit(1) gives a column filled with the value 1 (int type)
  2. lit([1, 2]) gives a column filled with the value [1, 2] (list of ints type)
  3. lit([[1, 2]]) gives a column filled with the value [[1, 2]] (list of list of ints type)

The first one is already correct, but the second and third are currently off. They are interpreted to be Series and that leads to different (unexpected) results.

There is amiguity as you can create a Series literal with literal as well. This is done with is_in a lot.

We can keep the behaviour for lit(Series) for now. Changing that would indeed be a bigger breaking change.

Although I am not sure this makes sense either. In my mind, we have three types of expression columns:

  1. A reference to an existing column, i.e. col("a")
  2. A literal column of a single value that dynamically adapts its length based on its context, i.e. lit(1)
  3. A full column of values, i.e. lit(Series([1,2,3]))

Making lit responsible for both 2 and 3 makes things a little confusing. You could argue 3 should create a list-of-ints type literal.

I am not sure a tuple should be an array though

A tuple is Python's standard fixed size collection, so I think it makes at least some amount of sense. But I'm open to alternatives.

evbo commented 1 year ago

@ritchie46 what is the equivalent recipe for creating a list literal for the Rust API currently?

update: got it!:

let inner: Series = some_vec.iter().collect();
let cont = AnyValue::List(inner);
let list: Series = Series::from_any_values_and_dtype("data", &[cont], &DataType::List(Box::new(DataType::Float32)), true).unwrap();
lf.with_column(lit(list).alias("y"))
kgv commented 2 months ago

@ritchie46 Is there currently a less verbose equivalent for creating a list literal on Rust? Or is the @evbo example the most succinct?