hynek / structlog

Simple, powerful, and fast logging for Python.
https://www.structlog.org/
Other
3.6k stars 225 forks source link

pandas + pandera + structlog + rich + raised exception = infinite loop #679

Open hackermandh opened 1 week ago

hackermandh commented 1 week ago

As silly as the title is, I think it's about the most condensed way to describe my problem.

  1. If I have a Pandas dataframe
  2. And I use a Pandera DataFrameModel class to validate said dataframe
  3. And I try to log the exception

Then the script will never stop (or at least run beyond the 5 minutes I was willing to wait)

For your ease, I already prepared a minimal example that triggers this behavior, and I come bringing a uv script:

Run this with uv run --script script.py, if you save this as script.py. Normally I wouldn't do this, but I know you're a fan of uv, as I am, so this should make both our lives easier 😉.

# /// script
# requires-python = ">=3.11" # python version does not seem to matter (3.7, 3.11 and 3.13 tested)
# dependencies = [
#     "pandas>=2.2.0", # 2.2.0 minimal version
#     "pandera>=0.20.0", # 0.20.0 minimal version
#     "rich>=13.9.4",  # if you disable `rich`, it'll run as expected, or:
#     "structlog>=24.4.0", # version 21.1.0 works as well, anything after it breaks.
# ]
# # run this file with "uv run --script script.py"
# ///
print("1. loading imports")
import pandas as pd
import pandera as pa
from pandera.typing import Series
from pandera.errors import SchemaError
from structlog.stdlib import get_logger

print("2. loading logger")
logger = get_logger(__name__)

print("3. loading schema")
class MySchema(pa.DataFrameModel):
    my_floats: Series[float] = pa.Field(
        alias="my_floats", check_name=True, nullable=False
    )

    class Config:
        coerce = True

print("4. loading dict")
MY_DICT = {
    "my_floats": {
        1: "tHiS iS nOt A fLoAt",
    },
}

print("5. dict to dataframe")
df = pd.DataFrame.from_dict(MY_DICT)

try:
    print("6. validation")
    MySchema.validate(df)
except SchemaError as schema_error:
    print(
        "7. logging the exception (cancel the script after 30 seconds, as it'll run forever)"
    )
    logger.exception("ingestion-validation-unsuccessful")
    print("8. you'll never reach this point")
    raise schema_error
hynek commented 1 day ago

Thanks for the nice MRE!

I'm afraid this is 100% a bug in the interaction between Rich and Pandas. Check this out:

# /// script
# requires-python = ">=3.11" # python version does not seem to matter (3.7, 3.11 and 3.13 tested)
# dependencies = [
#     "pandas>=2.2.0", # 2.2.0 minimal version
#     "pandera>=0.20.0", # 0.20.0 minimal version
#     "rich>=13.9.4",  # if you disable `rich`, it'll run as expected, or:
# ]
# # run this file with "uv run --script script.py"
# ///
print("1. loading imports")
import pandas as pd
import pandera as pa
from pandera.typing import Series
from pandera.errors import SchemaError
from rich.traceback import Traceback

print("2. loading logger")

print("3. loading schema")
class MySchema(pa.DataFrameModel):
    my_floats: Series[float] = pa.Field(
        alias="my_floats", check_name=True, nullable=False
    )

    class Config:
        coerce = True

print("4. loading dict")
MY_DICT = {
    "my_floats": {
        1: "tHiS iS nOt A fLoAt",
    },
}

print("5. dict to dataframe")
df = pd.DataFrame.from_dict(MY_DICT)

try:
    print("6. validation")
    MySchema.validate(df)
except SchemaError as schema_error:
    import sys
    print("7. Calling into Rich")
    Traceback.from_exception(*sys.exc_info())
    print("Sad trombone")

I think, I would start by opening a bug over at Rich first.