Sparse mmap()s are counted as fully allocated from the start, which can be very misleading

itamarst commented 2 years ago

Discussed in https://github.com/pythonspeed/filprofiler/discussions/297

^{Originally posted by **fohria** January 26, 2022} hey! thanks for this profiler, it looks very useful, if i can figure out how to use it :) i have a short script that, in short, generates a bunch of data, and then plots it. depending on how much i generate, memory use can be many gigabytes. so i'd like to profile it so i can find out when and where i may have some dataframes hanging around from function calls that i can delete when they're not needed anymore, like after i have dumped it to a file. however, running it with fil, i get this: ![image](https://user-images.githubusercontent.com/8695061/151163298-e3e11e1e-b137-4240-bcb7-75de88761341.png) the light pink on the left are the plotting calls, but what does it mean that it says `` all over? tldr version of my code is: ``` data = generate_data(how_much) # returns a pd.dataframe figure = plotting_call(data) ``` (i've installed fil to the same conda env i use for the script, if that matters)

itamarst commented 2 years ago

So far I have compiled code locally, and compared to last released version. Locally compiled Fil doesn't even show the import as using any memory at all (which... kinda makes sense, mostly it's an mmap of a file, with a few tiny allocations that should be filtered out).

itamarst commented 2 years ago

The reason for difference I was seeing, where sometimes NumPy is included and sometimes it isn't, is because of the threadpool changes. If numexpr is installed, NumPy gets imported as part of thread pool setup (via numexpr import), and so its memory isn't tracked because it's imported before tracking is started. If numexpr is not installed, NumPy is only imported at user code runtime, and therefore its memory usage is tracked.

So maybe we want to check for numexpr existence without importing it.

Regardless, however, that doesn't explain the variability in report NumPy-import memory usage, so next step is figuring out why it sometimes has huge % when it shouldn't.

itamarst commented 2 years ago

I think I figured it out:

Some BLAS implementations have a threadpool, likely tied to # of CPUs.
Each thread does a large anonymous mmap().

ADD MMAP 134217728    0: filpreload::add_allocation
   1: <unknown>
   2: alloc_mmap
   3: blas_memory_alloc
   4: blas_thread_server
   5: start_thread
   6: clone

Thus depending on detected number of CPUs and BLAS version the reported memory usage for importing numpy can vary quite a bit.

itamarst commented 2 years ago

Of course, in theory there should only be a single thread when using Fil. So something is wrong with the threadpool-controlling code too, it seems (there were three of the above tracebacks when running under Conda).

Update: threadpoolctl does not seem to reduce number of threads in NumPy, unclear why. Filed an issue: https://github.com/joblib/threadpoolctl/issues/121

itamarst commented 2 years ago

It's not clear that current approach of limiting to one thread is correct (assuming it can be fixed). Zeroed out new mmap() doesn't actually use any memory, should we really be counting all of it? And if the user is using BLAS, the profiling will be ignoring potentially a large chunk of memory, especially on machines with high core count.

Alternatives:

Fixed status quo, just one thread.
Stop limiting to one thread. Just show the memory, document the caveats. Probably very confusing, especially since many people won't be using BLAS at all.
Stop limiting to one thread. Poll maps occasionally to see how much of mmap is dirty; only count that for purposes of memory usage. This makes "peak memory" harder to accurately find. It also may not work on macOS. Callstack would still refer to original source of allocation, e.g. import numpy so harder to tie it to e.g. filling in the array contents.
Use userfaultfd to track when pages become dirty. This would give accurate callstack attribution. Would only work on Linux, needs ptrace capability. Scary to implement.

itamarst commented 2 years ago

For alternative 3, checking how much of mmap is filled coudl be done whenever we check for a new peak, which should ... correctly catch peaks, I think.

itamarst commented 2 years ago

For alternative 3, looks like the info is available on macOS via the vmmap utility. https://github.com/rbspy/proc-maps wraps underlying API, although not with the info we'd need. The latter also claims it requires root and won't work with SIP, and yet I was able to do that on my macOS setup... possibly that's for arbitrary processes? Which is not a use case Fil has. So might work fine.

itamarst commented 2 years ago

As a short-term workaround until alternative 3 above is implemented, I'm going to make sure numpy is always imported before profiling starts. The memory used by numpy won't get counted, but in many ways that's not under the user's control anyway. So seems like a reasonable way to at least give consistent results (it's not like Fil guarantees it tracks everything, anyway).

itamarst commented 2 years ago

Retrieving information

On Linux, the data is in /proc/self/smaps. There is a Rust parser (in procfs) but it does a bunch of allocation and can be expected to be pretty slow. https://man7.org/linux/man-pages/man5/proc.5.html documents the format. We would need to parse just:

The address (offset is for files) from the first line for each map.
...

itamarst commented 2 years ago

The data structure representation for this is probably:

When allocating large items, remember the address in a set.
Retrieve (somehow, sometime) set of "here's much memory was not actually allocated" per memory range (i.e. mmap()).
When calculating "should I store new peak memory", the new potential peak memory bytes and per-callstack numbers can be reduced using items 1 and 2.

The problem with this is that retrieving 2 is likely to be expensive, so doing it on every free() is ... not ideal. Need to measure of course, but it's "open file, read file potentially as large as 1MB, parse it". For Sciagraph this is a little less problematic since there's sampling, but still.

One possible heuristic: only do the parse if there have been minor page faults in the interim since last check, presuming minor page faults are good indicator (need to check) and getrusage is sufficiently cheap (again, need to check). This may again be more viable with Sciagraph, depending on measurements.

itamarst commented 2 years ago

/proc/self/smaps_rollup is another alternative to getrusage.

itamarst commented 2 years ago

https://github.com/javierhonduco/bookmark reads /proc/self/pagemap

pythonspeed / filprofiler

Sparse mmap()s are counted as fully allocated from the start, which can be very misleading #308

Discussed in https://github.com/pythonspeed/filprofiler/discussions/297

Retrieving information