pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.21k stars 1.95k forks source link

pl.StringCache() is not re-entrant #15073

Open douglas-raillard-arm opened 7 months ago

douglas-raillard-arm commented 7 months ago

Checks

Reproducible example

import polars as pl

@pl.StringCache()
def f(x):
    if x == 0:
        return 'end'
    else:
        return f(x-1)
f(1)

Log output

Traceback (most recent call last):
  File "testpolars3.py", line 10, in main
    f(1)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "venv/lib/python3.8/site-packages/polars/string_cache.py", line 76, in __exit__
    del self._string_cache
AttributeError: _string_cache

Issue description

StringCache is documented as being usable as a decorator. However, it fails to handle recursive functions properly.

The sources of StringCache show the issue: https://github.com/pola-rs/polars/blob/4bc67a0d0f6c9a113fd6b231d0d9638e58407156/py-polars/polars/string_cache.py#L66

Here is a simplified version:

class StringCache:
    def __enter__(self):
        self._string_cache = PyStringCacheHolder()
        return self

    def __del__(self, *args):
        del self._string_cache

When entering the context manager multiple times, such as in this example:

cm = pl.StringCache()
with cm:
    with cm:
        pass

The self._string_cache attribute is overwritten multiple times, but then self._string_cache is deleted by __exit__ in the inner layer and therefore raises in the outer layers. This could be fixed by either keeping a count of each time __enter__ is entered, or by using a stack (list) of caches in self._string_cache, and only popping the last level in __exit__

Expected behavior

It should not raise any exception

Installed versions

``` --------Version info--------- Polars: 0.20.15 Index type: UInt32 Platform: Linux-5.15.0-92-generic-x86_64-with-glibc2.29 Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: gevent: hvplot: 0.9.2 matplotlib: 3.7.4 numpy: 1.24.4 openpyxl: pandas: 2.0.3 pyarrow: 15.0.0 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
mcrumiller commented 7 months ago

I believe there used to be a counter, and was used with toggle_string_cache() back when that existed, and that was further deprecated since toggling off wouldn't actually toggle, but decrement the counter.

I agree that for context managers a counter should definitely be used here.

deanm0000 commented 7 months ago

I'm not sure if this is in the scope of polars to fix. It seems from a couple google searches that it might be a limitation on decorating recursive functions. A couple workarounds would be to do

@pl.StringCache()
def f(x):
    def _f(x):
        if x == 0:
            return 'end'
        else:
            return _f(x-1)
    return _f(x)

or

def f(x):
    if x == 0:
        return 'end'
    else:
        return f(x-1)

@pl.StringCache()
def f2(x):
    return f(x)

Here's an interesting SO answer that inspired the first workaround.

@stinodego should this be a polars bug or close?

douglas-raillard-arm commented 7 months ago

@deanm0000

It seems from a couple google searches that it might be a limitation on decorating recursive functions

There is no particular problem decorating recursive functions in general. That specific implementation of the decorator has that issue. It's like saying functions in Python are expected to be buggy because you can write a buggy function. On top of that, it's not even the decorator that has an issue, it's the context manager as demonstrated in the original report:

cm = pl.StringCache()
with cm:
    with cm:
        pass

Now you could argue that re-entrance is not to be expected on a context manager, but again, a trivial fix solves both cases.

Also the doc states:

The class can also be used as a function decorator, in which case the string cache is enabled during function execution, and disabled afterwards.

This is not true currently in the general case. The issue could be "fixed" by stating it won't work on recursive functions but actually fixing the code is even easier:

class StringCache:
    def __init__(self):
        self._string_cache = []

    def __enter__(self):
        self._string_cache.append(PyStringCacheHolder())
        return self

    def __del__(self, *args):
        self._string_cache.pop()

That version (with a stack instead of a counter) will still behave properly if PyStringCacheHolder becomes smarter and aware of its own nesting (right now it looks like a single Rust global variable, but in case this changed that Python code would just follow the Rust behavior naturally, whatever it does)

douglas-raillard-arm commented 7 months ago

Also if enabling a re-entrant context manager is not desired (as it would set that API decision in stone (not that it matters since you could always make it work regardless of the implementation but well ...), you can also just fix the decorator:

import functools

class StringCache:
    ...
    def __call__(f):
        @functools.wraps(f)
        def wrapper(*args, **kwargs):
            with self.__class__():
                return f(*args, **kwargs)

        return wrapper

That decorator will work just fine on recursive functions too, since it creates a new StringCache context manager for every call, rather than creating a single one once and for all that is attached to the function itself. Similarly to fixing the cm, it can be made to use a counter and a single top-level instance if necessary.

mcrumiller commented 7 months ago

A simpler solution would probably be to only give the first StringCache control over the globalization. All other string cache contexts inside are no-ops.