ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
4.41k stars 544 forks source link

bug: inlining expressions leads to wrong results for non-pure functions #8921

Open NickCrews opened 2 months ago

NickCrews commented 2 months ago

What happened?

See this:

import ibis
import random

ibis.options.interactive = True

@ibis.udf.scalar.python(side_effects=True)
def my_random(x: float) -> float:
    return random.random()

t = ibis.memtable({"x": [1, 2, 3, 4, 5]})
# this happens with both the builtin random, and UDFs:
# t = t.mutate(y=ibis.random())
t = t.mutate(y=my_random(t.x))
t = t.mutate(size=(t.y > 0.5).ifelse("big", "small"))
t
# ┏━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓
# ┃ x     ┃ y        ┃ size   ┃
# ┡━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
# │ int64 │ float64  │ string │
# ├───────┼──────────┼────────┤
# │     1 │ 0.208385 │ big    │
# │     2 │ 0.566183 │ big    │
# │     3 │ 0.836448 │ small  │
# │     4 │ 0.530235 │ small  │
# │     5 │ 0.526687 │ small  │
# └───────┴──────────┴────────┘

Notice that the size column is sometimes incorrect. This is because size is getting calculated from a totally uncorrelated RANDOM() in the generated SQL. ibis.to_sql(t) shows:

SELECT
  "t0"."x",
  RANDOM() AS "y",
  CASE WHEN RANDOM() > CAST(0.5 AS DOUBLE) THEN 'big' ELSE 'small' END AS "size"
FROM "ibis_pandas_memtable_m5jukuqv2rcrhkpg2vsdu6k5fe" AS "t0"

For this to work as expected, there can only be a single RANDOM() in the generated SQL. I think that the agressive inlining that we do to merge selects/mutates together is over-aggressive. After a short talk with @cpcloud, this inlining is only correct when the entire expression is pure. Either we need to somehow know when an expression is pure (most are, except for (some) UDFs, random(), maybe now(), and maybe others I'm not thinking of?), or just be conservative and only do this inlining when the subsequent selects/mutates are just reselecting/renaming columns with no other changes.

What version of ibis are you using?

main

What backend(s) are you using, if any?

I'm not sure if this is present on all backends, but I think so.

Relevant log output

No response

Code of Conduct

kszucs commented 2 months ago

We can add these impure types to https://github.com/ibis-project/ibis/blob/main/ibis/backends/sql/rewrites.py#L155 in order to prevent merging those selects.

NickCrews commented 2 months ago

That would work, if we are sure we can get all the types. What about ops.ArrayDistinct though? In duckdb that is an impure function, because the order of the result is not guaranteed. But in other backends it might be a pure function. Do we just have to be conservative and call that a blocking op?

NickCrews commented 2 months ago

ibis.uuid() is another impure expression to worry about.

What do you think is the counter-argument against only merging on re-selects? Is it just the uglier/more verbose SQL? Unless we have evidence that it is slower, I would want to lean towards the less-likely-to-be-incorrect route, even if the generated SQL is uglier for human consumption.

kszucs commented 2 months ago

The problem doesn't just occur in case of merging Select nodes. Consider the following case:

In [1]: import ibis

In [2]: uid = ibis.uuid()

In [3]: t = ibis.table({"a": "int", "b": "string"})

In [4]: t1 = t.select(t, uid)

In [7]: t1 = t.select(t, uid=uid)

In [8]: t2 = t1.select(uid)

In [9]: t2
Out[9]:
r0 := UnboundTable: unbound_table_0
  a int64
  b string

r1 := Project[r0]
  a:   r0.a
  b:   r0.b
  uid: RandomUUID()

Project[r1]
  RandomUUID(): r1.uid

In [10]: t3 = t1.select(ibis.uuid())

In [11]: t3
Out[11]:
r0 := UnboundTable: unbound_table_0
  a int64
  b string

r1 := Project[r0]
  a:   r0.a
  b:   r0.b
  uid: RandomUUID()

Project[r1]
  RandomUUID(): r1.uid

This "rewrite" happens at the API level as well during value dereferencing which we certainly would like to keep. In the first case we use the same variable so we mean the same value whereas in the latter we create a new uuid assuming that the user intention is to create a new one.

I think this issue can be resolved by better modelling of inpure expressions like uuid() and random() and making them actually unique by a virtual field, then we would need to mark those expressions solving both the dereferencing and Select merge cases:

class Unique(Value):
    _counter = itertools.count()
    uid: Optional[int] = None

    def __init__(self, uid, **kwargs):
        if uid is None:
            uid = next(self._counter)
        super().__init__(uid=uid, **kwargs)

class RandomUUID(Unique):
    dtype = dt.uuid

class RandomScalar(Unique):
    dtype = dt.float64
cpcloud commented 2 months ago

What's the issue with disabling the merging? Otherwise we'll be playing whack-a-mole with discovering impure functions.

kszucs commented 2 months ago

Disabling merging would solve the original problem in the issue but wouldn't solve the dereferencing one shown by me. My proposed solution would solve both.

NickCrews commented 1 month ago

Sorry I went on vacation for a few days, therefore radio silence.

I'm hesitant that the linked PR wasn't the right solution:

If I wanted y and z to be correlated, I would have referenced the parent column directly:

    impure = func()
    t = ibis.table({"x": "int64"}, name="t")
    t1 = t.mutate(y=impure)
    t2 = t1.mutate(z=t1.y.cast("string")) 
    expected = ops.Project(
        parent=t1,
        values={"x": t1.x, "y": t1.y, "z": t1.y.cast("string")},
    )

If I don't reference the parent table in subsequent selects, then I expect the columns to be uncorrelated:

    impure = func()
    t = ibis.table({"x": "int64"}, name="t")
    t1 = t.mutate(y=impure)
    t2 = t1.mutate(z=impure.cast("string"))
    expected = ops.Project(
        parent=t1,
        values={"x": t1.x, "y": t1.y, "z": impure.cast("string")},
    )

I think we need to rely on the user doing explicit .selects and .mutates to be the boundary locations where expressions are separated as correlated/uncorrelated.

NickCrews commented 1 month ago

The problem doesn't just occur in case of merging Select nodes. Consider the following case:

I see what you're saying here, that does look like a problem. Just to be sure here, what I would expect is for both cases to be something like the very simple

UnboundTable:
   RandomUUID(): RandomUUID()

(ie there is literally NO reference to the parent tables, because we only selected a single UUID expression, which doesn't depend on any parents).

Is this what you would expect too?

cpcloud commented 1 month ago

More significantly, the added test case isn't actually what I would expect.

Well, yeah :) See my comment here: https://github.com/ibis-project/ibis/pull/8967#issuecomment-2057192320

NickCrews commented 1 month ago

In this comment, @kszucs says:

# semantically this should produce two different columns
t.select(x=ibis.random(), y=ibis.random())
# semantically this should produce identical columns
rand = ibis.random()
t.select(x=rand, y=rand)

I think that both of these should actually produce two different columns. If you want to get two identical columns, I think you should have to explicitly re-select from a parent relation:

t.select(common=ibis.random()).select(x=_.common, y=_.common)

Consider this example (maybe I'm not understanding the implementation, correct me if so):


# @functools.cache()
def generate_noise(amplitude):
    return (ibis.random() - .5) * amplitude

def add_noise(t, columns: list[str]):
    return t.mutate(**{col:t[col] + generate_noise(100) for col in columns})

If there is a @functools.cache() on the helper function, then the id of the returned expression is the same, and so you get correlated noise in each of the columns. If there is no @functools.cache(), then the noise will be uncorrelated. I don't think this behavior should depend on such an implementation detail.

cpcloud commented 1 month ago

@kszucs Perhaps we should consider restricting impurity further, to effectively require a projection to get the same value.

I agree with Nick that t.select(x=rand, y=rand) should produce two random() invocations, and I don't think we should make assumptions about impurity based on whether two variables are the same.

A variable isn't the same as a column in a projection.

kszucs commented 1 month ago

I can accept that as the desired behaviour, though we need to prevent dereferencing in that case. Going to take a look.

NickCrews commented 1 month ago

This comment is maybe getting towards the use cases we need to support?