pythonspeed / filprofiler

A Python memory profiler for data processing and scientific computing applications
https://pythonspeed.com/products/filmemoryprofiler/
Apache License 2.0
833 stars 24 forks source link

Tracing allocations across thread pools #438

Open itamarst opened 1 year ago

itamarst commented 1 year ago

A common pattern (in BLAS libraries, numexpr, blosc2, Polars) is to have a thread pool that runs tasks on behalf of the Python thread. Allocations can happen in this thread pool.

From a Python programmer's perspective, they care which Python code was responsible for the allocation, especially when (as is the most likely scenario) they're using a third-party library. Unfortunately, since the allocation happens in a different thread connecting the allocation there to the Python code that is causally responsible isn't easy.

At the moment Fil (but not Sciagraph) solves this by setting libraries to single-threaded, so the native code runs in the same thread. This suffers from a number of issues, from distorting runtime outcomes to lack of support for arbitrary libraries, e.g. Polars (and Polars' thread limiting doesn't even seem to work?).

Another option: try to get these libraries to expose the information necessary to track causality across threads, for the benefit of memory profilers (and perhaps performance profilers?). This works specifically because of the thread pool model where there's a specific request being sent to the thread pool and a result sent back.

3rd party library support

A library that wanted to support this would have to do the following:

  1. When a task is submitted to the thread pool, get the current thread id, send it across to the thread pool with the task.
  2. When a thread receives a task, it sets a thread local to that originating thread id.
  3. When a thread finishes a task, it clears the thread local, just in case.
  4. The library exposes a public API functionmylibrary_get_responsible_thread_for_current_task() which returns the thread ID by reading the thread local, and can then be used by the memory profiler to match up with the responsible Python thread, which would presumably be waiting on the thread pool.

Ideally all libraries would use a consistent concept of thread ID, and this would have to cross-platform, but this isn't really a lot of code. So it seems feasible to submit it as patches to all the relevant upstream libraries.

Profiler support

The memory profiler when trapping e.g. malloc() would use that API, and then potentially have to get the callstack for a different thread... which may be tricky, but:

Also I guess it would have to know which library is responsible... So I guess that implies need for a mapping from thread id to which library is running the code. Sciagraph already has hooks to do this. Perhaps there's another way as well via library support? The library knows which threads it is managing, after all.

itamarst commented 1 year ago

Instead of having all libraries implement their own API, another option is some shared library they all use, which would reduce maintenance burden for OpenBLAS etc. maintainers, and make life easier for profilers since it require integration only once.

This involves a can of worms (Python extensions are dlopen()ed, how do you distribute 3rd party shared library in multiple OSes, how does it get loaded at all)... Best idea I have at the moment is #ifdef code combined with calls to shared library that gets preloaded via Python extension module, so this only is enabled for Python code. But that's very handwavy and dlopen() is not one's friend.

I guess another options is CPython grows support for this info but that seems esoteric...

itamarst commented 1 year ago

I wonder if you could do something involving a .h file that has a function that uses dlsym() to get at Python's thread local API. And profilers could just look up task IDs at a well-known thread local key. That's still not a complete solution, though, since there's still the problem of how you have a single global key shared across all libraries.