Open allonhadaya-maven opened 9 months ago
Could you give me more information on you setup (like is there async or gevent involved)? Does it have some especially high load? It seems weird that a copied context could change while iterating and nobody running into it in 3 years. π€
Chiming in to note that we're experiencing the exact same issue. Weirdly enough, we're only observing the issue in a single AWS Lambda project of ours (although structlog is used in all of them, including FastAPI and Flask-based ones, as well as other Lambdas), and only on a single contextvar that I've added to the code recently. It also doesn't happen every time structlog is initialized, suggesting that it might indeed be a race condition.
The new contextvar is added using a second call to structlog.contextvars.bind_contextvars
instead of being an additional kwarg to the first call. I've just modified our code to only do a single call to bind_contextvars
and will observe if it fixes the issue. I'll post an update with the results.
@mLupine thanks for contributing to the fun murder mystery! Mind sharing your versions of Python and structlog? TIA!
Which reminds me: @eladnam any chance any of this makes any sense to you?
@mLupine thanks for contributing to the fun murder mystery! Mind sharing your versions of Python and structlog? TIA!
Sure: Python 3.11.6, structlog 23.2.0 (when the problematic contextvar was added) to 23.3.0 (now)
One more question (to both): is asyncio involved in those apps at all?
We've encountered some weirdness when mixing them (since threads use thread-locals and asyncio does not).
One more question (to both): is asyncio involved in those apps at all?
Not at all.
I also confirmed that moving all contextvars binds to a single bind_contextvars
call does not seem to help, the issue still persists β weirdly, still with just one of all vars, and always the same one (it's just a string with a random UUID inside).
Does the var change if you define them in a different order? π€ How many do you have in total?
OK, y'all know what: pls try this instead and report back pls:
def merge_contextvars(
logger: WrappedLogger, method_name: str, event_dict: EventDict
) -> EventDict:
for k, v in tuple(contextvars.copy_context().items()):
if k.name.startswith(STRUCTLOG_KEY_PREFIX) and v is not Ellipsis:
event_dict.setdefault(k.name[STRUCTLOG_KEY_PREFIX_LEN:], v)
return event_dict
Does the var change if you define them in a different order? π€ How many do you have in total?
It doesn't β and when the var is removed completely and another one takes its spot, the issue goes away. There's always between 2 and 5 context vars and only that one generates errors.
OK, y'all know what: pls try this instead and report back pls
Deploying the change, will let you know if the issue persists π«‘
Well, arguably it did indeed fix the issue, but it did so by bringing the entire application down, so I wouldn't exactly count this one as a win π
MemoryError: null
[irrelevant part cut out]
File "structlog/_native.py", line 134, in meth
return self._proxy_to_logger(name, event, **kw)
File "structlog/_base.py", line 216, in _proxy_to_logger
args, kw = self._process_event(method_name, event, event_kw)
File "structlog/_base.py", line 167, in _process_event
event_dict = proc(self._logger, method_name, event_dict)
File "infrastructure_common/logging/logging.py", line 221, in merge_contextvars
for k, v in tuple(contextvars.copy_context().items()):
MemoryError: null
!? Holy shit? I don't think I've seen that in over 15 years of writing Python. I guess that's the same problem/race condition.
I'm not asking you to retry, but did it come immediately? Like is this error more reproducible?
Yeah, I was quite surprised to see that too. In terms of reproducibility, if we were to trust Datadog, failure rate was 100%
Can you reproduce it outside of the runtime? And/or share the how you configure structlog / how it's plugged into requests?
Also you mentioned datadog which AFAIK does quite aggressive instrumentation β @allonhadaya-maven are you using datadog too?
Also: technically, MemoryError means that the interpreter ran out of memory:
Raised when an operation runs out of memory but the situation may still be rescued (by deleting some objects). The associated value is a string indicating what kind of (internal) operation ran out of memory. Note that because of the underlying memory management architecture (Cβs malloc() function), the interpreter may not always be able to completely recover from this situation; it nevertheless raises an exception so that a stack traceback can be printed, in case a run-away program was the cause.
Could this be some kind of leak?
π @mLupine On the Datadog instrumentation point, we added a structlog
integration with version 2.2.0. To check whether our instrumentation is playing a part in this, may I suggest the following steps:
ddtrace>=2.2.0
and have logs injection enabled, could you try with logs injection disabledddtrace>=2.2.0
and the above didn't help, could you try downgrading to ddtrace~=2.1.0
If you are using ddtrace<2.2.0
it's unlikely that the Datadog instrumentation is at fault here, but we're keen to keep looking into it in case the steps above didn't help π .
@P403n1x87 thanks for chiming in! Apart from logging instrumentation, arenβt you doing something to contextvars too? I got the impression from a cursory search.
It doesnβt look to me like a logging thing to me, but either contextvars have a serious bug, wrong documentation, or something is interfering.
Disabling Datadog injection did not, unsurprisingly, help. What I've just found might, though.
It turns out that in the project that shows those weird errors we had some obsolete structlog initialization code, which was executed separately from the unified structlog-based library we recently migrated all of our project to. This resulted in doubled structlog.configure
calls with setup that we didn't really expect or want.
I've now cleaned up our code and suspect that this will be enough to make those errors go away. I'll update and close the issue if it turns out to be the case.
Hmmm, I'm curious but multiple configures shouldn't be a problem. It's not like structlog initializes contextvars on configuration. The only changes happen in bind_contextvars
and clear_contextvars
. π€
If that really helps, I'd be curious to learn what those configs exactly did.
It did not help, I can still see those errors occurring once every few hours, still only on that poor single contextvar π
Can you give me:
given only two people have reported this problem in almost three years, it must be something small/weird.
I've talked to @1st1 who kinda implemented contextvars and he also said the observed behavior of contextvars simply shouldn't happen.
Hey,
Getting back to the issue after a little break.
your runtime (something crazy like uWSGI?? you mentioned lambda, but is here more to it?)
Not at all, just regular Python 3.11 Lambdas
the structlog config
Omitted some portions of the code, but here's most of the configuration function
processors = [
structlog.contextvars.merge_contextvars,
structlog.stdlib.add_log_level,
structlog.processors.StackInfoRenderer(),
datadog_tracer_injection,
structlog.dev.set_exc_info,
structlog.processors.CallsiteParameterAdder(
[CallsiteParameter.FILENAME, CallsiteParameter.FUNC_NAME, CallsiteParameter.LINENO],
),
structlog.processors.TimeStamper(fmt="iso"),
SentryProcessor(event_level=logging.ERROR, tag_keys="__all__"),
structlog.processors.dict_tracebacks,
structlog.processors.EventRenamer("message", "_event"),
structlog.processors.JSONRenderer(serializer=_json_serializer),
]
structlog.configure(
cache_logger_on_first_use=True,
wrapper_class=structlog.make_filtering_bound_logger(
logging.INFO
),
processors=processors, # type: ignore[arg-type]
)
a list of variables with their types
Most of the time, it's just the following two top-level variables:
request: dict
lambda_instance_id: str
request_id: str
identity: str | None
query_string: str | None
path_parameters: str | None
body: str | None
lambda_instance_id: str
whether you run clear_contextvars regularly
Not in lambdas, no. If a Lambda is executed on a warm instance, we just bind contextvars again before the request is handled, overwriting previous values.
Could you please try using clear_contextvars then? Iβm totally fishing in the dark here, but lambda and its potential idiosyncrasies is the only unusual thing about your setup. :(
I've modified our handler decorator to use the bound_contextvars
context manager so that it cleans up our custom contextvars nicely after every Lambda invocation.
And yeah, it didn't help at all β οΈ
Thank you for humoring me!
If we can't figure this out, I guess I'll have to undeprecate the thread local module. :(
Does it exist somewhere in the current codebase so that I can try it out? I've studied the codebase of our project yet another time and once again found nothing that could interfere with contextvars, especially in such a weird way π
Just use https://www.structlog.org/en/stable/thread-local.html and ignore the warnings (pls use only the contextvars equivalents).
I don't think it's your code, I suspect it's the environment or some kind of instrumentation. Unfortunately we haven't heard back from @allonhadaya-maven so I don't know if theirs is special in any way.
If it would happen under normal circumstances, the issue tracker would be overflowing.
After a few weeks after switching to thread local vars I can confirm it "solves" the issue in our Lambdas. However, just a few days ago we started noticing the same errors in one of our Flask-based applications. It doesn't share much code with our Lambdas, although the internal logging library is the same. We're still on structlog 24.1.0, and this leads me to believe that the errors might be related to something we're doing in our code. I'm just out of ideas about what that something could be and it's not exactly easy to debug given that there are like 10 errors a week despite millions of executions π£
I've talked to @1st1 and @elprans at PyCon and given your MemoryError(None)
the main/only suspicion is a faulty memory allocator.
We're also facing this error with Flask. It happens very infrequently. Let me know if I can provide any useful information.
Anything I wrote above! It smells like a runtime / CPython issue, so anything you can share about your environment, please do share. Especially if you're running on Alpine (or some similar musl distro) and double-check your memory consumption.
During a recent update of
structlog
from version20.2.0
to21.1.0
in a large Flask + Gunicorn web app, we encountered a new error coming fromcontextvars.merge_contextvars
:The same
KeyError
can be observed in our application for contextvars with different names beyondstructlog_http.x-real-ip
.Updating to the latest version of
structlog
,v23.3.0
does not seem to have resolved the issue.Further inspection suggests that https://github.com/hynek/structlog/pull/302 may have introduced a race condition to the
merge_contextvars
function when the implementation changed:from...
to...
It appears possible for
ctx[k]
to raise aKeyError
despite the fact thatk
is defined asfor k in ctx:
We have temporarily reverted the
structlog
update, but would love to help develop a fix for this issue upstream. Please let us know if/how we can help with further investigation.