merge_contextvars raises a KeyError

allonhadaya-maven commented 9 months ago

During a recent update of structlog from version 20.2.0 to 21.1.0 in a large Flask + Gunicorn web app, we encountered a new error coming from contextvars.merge_contextvars:

builtins.KeyError: <ContextVar name='structlog_http.x-real-ip' default=Ellipsis at 0x7fe6b387c3b0>
Traceback (most recent call last):
  ...
  File "/api/app.py", line 148, in response_after_request
    log.debug("api_hit", **{"http.status_code": response.status_code})
  File "/usr/local/lib/python3.8/site-packages/structlog/stdlib.py", line 142, in debug
    return self._proxy_to_logger("debug", event, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/structlog/stdlib.py", line 213, in _proxy_to_logger
    return super()._proxy_to_logger(method_name, event=event, **event_kw)
  File "/usr/local/lib/python3.8/site-packages/structlog/_base.py", line 203, in _proxy_to_logger
    args, kw = self._process_event(method_name, event, event_kw)
  File "/usr/local/lib/python3.8/site-packages/structlog/_base.py", line 160, in _process_event
    event_dict = proc(self._logger, method_name, event_dict)
  File "/usr/local/lib/python3.8/site-packages/structlog/contextvars.py", line 81, in merge_contextvars
    if k.name.startswith(STRUCTLOG_KEY_PREFIX) and ctx[k] is not Ellipsis:
KeyError: <ContextVar name='structlog_http.x-real-ip' default=Ellipsis at 0x7fe6b387c3b0>

The same KeyError can be observed in our application for contextvars with different names beyond structlog_http.x-real-ip.

Updating to the latest version of structlog, v23.3.0 does not seem to have resolved the issue.

Further inspection suggests that https://github.com/hynek/structlog/pull/302 may have introduced a race condition to the merge_contextvars function when the implementation changed:

from...

def merge_contextvars(
    logger: WrappedLogger, method_name: str, event_dict: Dict[str, Any]
) -> Context:
    """
    A processor that merges in a global (context-local) context.

    Use this as your first processor in :func:`structlog.configure` to ensure
    context-local context is included in all log calls.

    .. versionadded:: 20.1.0
    """
    ctx = _get_context().copy()
    ctx.update(event_dict)

    return ctx

to...

def merge_contextvars(
    logger: WrappedLogger, method_name: str, event_dict: EventDict
) -> EventDict:
    """
    A processor that merges in a global (context-local) context.

    Use this as your first processor in :func:`structlog.configure` to ensure
    context-local context is included in all log calls.

    .. versionadded:: 20.1.0
    .. versionchanged:: 21.1.0 See toplevel note.
    """
    ctx = contextvars.copy_context()

    for k in ctx:
        if k.name.startswith(STRUCTLOG_KEY_PREFIX) and ctx[k] is not Ellipsis:
            event_dict.setdefault(k.name[len(STRUCTLOG_KEY_PREFIX) :], ctx[k])

    return event_dict

It appears possible for ctx[k] to raise a KeyError despite the fact that k is defined as for k in ctx:

We have temporarily reverted the structlog update, but would love to help develop a fix for this issue upstream. Please let us know if/how we can help with further investigation.

hynek commented 9 months ago

Could you give me more information on you setup (like is there async or gevent involved)? Does it have some especially high load? It seems weird that a copied context could change while iterating and nobody running into it in 3 years. 🤔

mLupine commented 9 months ago

Chiming in to note that we're experiencing the exact same issue. Weirdly enough, we're only observing the issue in a single AWS Lambda project of ours (although structlog is used in all of them, including FastAPI and Flask-based ones, as well as other Lambdas), and only on a single contextvar that I've added to the code recently. It also doesn't happen every time structlog is initialized, suggesting that it might indeed be a race condition.

The new contextvar is added using a second call to structlog.contextvars.bind_contextvars instead of being an additional kwarg to the first call. I've just modified our code to only do a single call to bind_contextvars and will observe if it fixes the issue. I'll post an update with the results.

hynek commented 9 months ago

@mLupine thanks for contributing to the fun murder mystery! Mind sharing your versions of Python and structlog? TIA!

hynek commented 9 months ago

Which reminds me: @eladnam any chance any of this makes any sense to you?

mLupine commented 9 months ago

@mLupine thanks for contributing to the fun murder mystery! Mind sharing your versions of Python and structlog? TIA!

Sure: Python 3.11.6, structlog 23.2.0 (when the problematic contextvar was added) to 23.3.0 (now)

hynek commented 9 months ago

One more question (to both): is asyncio involved in those apps at all?

We've encountered some weirdness when mixing them (since threads use thread-locals and asyncio does not).

mLupine commented 9 months ago

One more question (to both): is asyncio involved in those apps at all?

Not at all.

I also confirmed that moving all contextvars binds to a single bind_contextvars call does not seem to help, the issue still persists — weirdly, still with just one of all vars, and always the same one (it's just a string with a random UUID inside).

hynek commented 9 months ago

Does the var change if you define them in a different order? 🤔 How many do you have in total?

hynek commented 9 months ago

OK, y'all know what: pls try this instead and report back pls:

def merge_contextvars(
    logger: WrappedLogger, method_name: str, event_dict: EventDict
) -> EventDict:
    for k, v in tuple(contextvars.copy_context().items()):
        if k.name.startswith(STRUCTLOG_KEY_PREFIX) and v is not Ellipsis:
            event_dict.setdefault(k.name[STRUCTLOG_KEY_PREFIX_LEN:], v)

    return event_dict

mLupine commented 9 months ago

Does the var change if you define them in a different order? 🤔 How many do you have in total?

It doesn't — and when the var is removed completely and another one takes its spot, the issue goes away. There's always between 2 and 5 context vars and only that one generates errors.

OK, y'all know what: pls try this instead and report back pls

Deploying the change, will let you know if the issue persists 🫡

mLupine commented 9 months ago

Well, arguably it did indeed fix the issue, but it did so by bringing the entire application down, so I wouldn't exactly count this one as a win 😏

MemoryError: null
  [irrelevant part cut out]
  File "structlog/_native.py", line 134, in meth
    return self._proxy_to_logger(name, event, **kw)
  File "structlog/_base.py", line 216, in _proxy_to_logger
    args, kw = self._process_event(method_name, event, event_kw)
  File "structlog/_base.py", line 167, in _process_event
    event_dict = proc(self._logger, method_name, event_dict)
  File "infrastructure_common/logging/logging.py", line 221, in merge_contextvars
    for k, v in tuple(contextvars.copy_context().items()):

hynek commented 9 months ago

MemoryError: null!? Holy shit? I don't think I've seen that in over 15 years of writing Python. I guess that's the same problem/race condition.

I'm not asking you to retry, but did it come immediately? Like is this error more reproducible?

mLupine commented 9 months ago

Yeah, I was quite surprised to see that too. In terms of reproducibility, if we were to trust Datadog, failure rate was 100%

hynek commented 9 months ago

Can you reproduce it outside of the runtime? And/or share the how you configure structlog / how it's plugged into requests?

hynek commented 9 months ago

Also you mentioned datadog which AFAIK does quite aggressive instrumentation – @allonhadaya-maven are you using datadog too?

hynek commented 9 months ago

Also: technically, MemoryError means that the interpreter ran out of memory:

Raised when an operation runs out of memory but the situation may still be rescued (by deleting some objects). The associated value is a string indicating what kind of (internal) operation ran out of memory. Note that because of the underlying memory management architecture (C’s malloc() function), the interpreter may not always be able to completely recover from this situation; it nevertheless raises an exception so that a stack traceback can be printed, in case a run-away program was the cause.

Could this be some kind of leak?

P403n1x87 commented 9 months ago

👋 @mLupine On the Datadog instrumentation point, we added a structlog integration with version 2.2.0. To check whether our instrumentation is playing a part in this, may I suggest the following steps:

if you are using ddtrace>=2.2.0 and have logs injection enabled, could you try with logs injection disabled
if you are using ddtrace>=2.2.0 and the above didn't help, could you try downgrading to ddtrace~=2.1.0

If you are using ddtrace<2.2.0 it's unlikely that the Datadog instrumentation is at fault here, but we're keen to keep looking into it in case the steps above didn't help 🙏 .

hynek commented 9 months ago

@P403n1x87 thanks for chiming in! Apart from logging instrumentation, aren’t you doing something to contextvars too? I got the impression from a cursory search.

It doesn’t look to me like a logging thing to me, but either contextvars have a serious bug, wrong documentation, or something is interfering.

mLupine commented 9 months ago

Disabling Datadog injection did not, unsurprisingly, help. What I've just found might, though.

It turns out that in the project that shows those weird errors we had some obsolete structlog initialization code, which was executed separately from the unified structlog-based library we recently migrated all of our project to. This resulted in doubled structlog.configure calls with setup that we didn't really expect or want.

I've now cleaned up our code and suspect that this will be enough to make those errors go away. I'll update and close the issue if it turns out to be the case.

hynek commented 9 months ago

Hmmm, I'm curious but multiple configures shouldn't be a problem. It's not like structlog initializes contextvars on configuration. The only changes happen in bind_contextvars and clear_contextvars. 🤔

If that really helps, I'd be curious to learn what those configs exactly did.

mLupine commented 9 months ago

It did not help, I can still see those errors occurring once every few hours, still only on that poor single contextvar 😏

hynek commented 9 months ago

Can you give me:

your runtime (something crazy like uWSGI?? you mentioned lambda, but is here more to it?)
the structlog config
a list of variables with their types
whether you run clear_contextvars regularly

given only two people have reported this problem in almost three years, it must be something small/weird.

I've talked to @1st1 who kinda implemented contextvars and he also said the observed behavior of contextvars simply shouldn't happen.

mLupine commented 8 months ago

Hey,

Getting back to the issue after a little break.

your runtime (something crazy like uWSGI?? you mentioned lambda, but is here more to it?)

Not at all, just regular Python 3.11 Lambdas

the structlog config

Omitted some portions of the code, but here's most of the configuration function

    processors = [
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.add_log_level,
        structlog.processors.StackInfoRenderer(),
        datadog_tracer_injection,
        structlog.dev.set_exc_info,
        structlog.processors.CallsiteParameterAdder(
            [CallsiteParameter.FILENAME, CallsiteParameter.FUNC_NAME, CallsiteParameter.LINENO],
        ),
        structlog.processors.TimeStamper(fmt="iso"),
        SentryProcessor(event_level=logging.ERROR, tag_keys="__all__"),
        structlog.processors.dict_tracebacks,
        structlog.processors.EventRenamer("message", "_event"),
        structlog.processors.JSONRenderer(serializer=_json_serializer),
    ]

    structlog.configure(
        cache_logger_on_first_use=True,
        wrapper_class=structlog.make_filtering_bound_logger(
            logging.INFO
        ),
        processors=processors,  # type: ignore[arg-type]
    )

a list of variables with their types

Most of the time, it's just the following two top-level variables:

request: dict
   lambda_instance_id: str
   request_id: str
   identity: str | None
   query_string: str | None
   path_parameters: str | None
   body: str | None
lambda_instance_id: str

whether you run clear_contextvars regularly

Not in lambdas, no. If a Lambda is executed on a warm instance, we just bind contextvars again before the request is handled, overwriting previous values.

hynek commented 8 months ago

Could you please try using clear_contextvars then? I’m totally fishing in the dark here, but lambda and its potential idiosyncrasies is the only unusual thing about your setup. :(

mLupine commented 7 months ago

I've modified our handler decorator to use the bound_contextvars context manager so that it cleans up our custom contextvars nicely after every Lambda invocation.

And yeah, it didn't help at all ☠️

hynek commented 7 months ago

Thank you for humoring me!

If we can't figure this out, I guess I'll have to undeprecate the thread local module. :(

mLupine commented 7 months ago

Does it exist somewhere in the current codebase so that I can try it out? I've studied the codebase of our project yet another time and once again found nothing that could interfere with contextvars, especially in such a weird way 😏

hynek commented 7 months ago

Just use https://www.structlog.org/en/stable/thread-local.html and ignore the warnings (pls use only the contextvars equivalents).

I don't think it's your code, I suspect it's the environment or some kind of instrumentation. Unfortunately we haven't heard back from @allonhadaya-maven so I don't know if theirs is special in any way.

If it would happen under normal circumstances, the issue tracker would be overflowing.

mLupine commented 5 months ago

After a few weeks after switching to thread local vars I can confirm it "solves" the issue in our Lambdas. However, just a few days ago we started noticing the same errors in one of our Flask-based applications. It doesn't share much code with our Lambdas, although the internal logging library is the same. We're still on structlog 24.1.0, and this leads me to believe that the errors might be related to something we're doing in our code. I'm just out of ideas about what that something could be and it's not exactly easy to debug given that there are like 10 errors a week despite millions of executions 😣

hynek commented 4 months ago

I've talked to @1st1 and @elprans at PyCon and given your MemoryError(None) the main/only suspicion is a faulty memory allocator.

Are you sure you don't run into low-memory situations?
Are you using a weird libc like Alpine's musl?
Do you do anything else that could interfere with memory allocation at a lower level?

MunifTanjim commented 2 months ago

We're also facing this error with Flask. It happens very infrequently. Let me know if I can provide any useful information.

hynek commented 2 months ago

Anything I wrote above! It smells like a runtime / CPython issue, so anything you can share about your environment, please do share. Especially if you're running on Alpine (or some similar musl distro) and double-check your memory consumption.

hynek / structlog

merge_contextvars raises a KeyError #591