Open itamarst opened 2 years ago
So far I have compiled code locally, and compared to last released version. Locally compiled Fil doesn't even show the import as using any memory at all (which... kinda makes sense, mostly it's an mmap of a file, with a few tiny allocations that should be filtered out).
The reason for difference I was seeing, where sometimes NumPy is included and sometimes it isn't, is because of the threadpool changes. If numexpr
is installed, NumPy gets imported as part of thread pool setup (via numexpr import), and so its memory isn't tracked because it's imported before tracking is started. If numexpr
is not installed, NumPy is only imported at user code runtime, and therefore its memory usage is tracked.
So maybe we want to check for numexpr existence without importing it.
Regardless, however, that doesn't explain the variability in report NumPy-import memory usage, so next step is figuring out why it sometimes has huge % when it shouldn't.
I think I figured it out:
ADD MMAP 134217728 0: filpreload::add_allocation
1: <unknown>
2: alloc_mmap
3: blas_memory_alloc
4: blas_thread_server
5: start_thread
6: clone
Thus depending on detected number of CPUs and BLAS version the reported memory usage for importing numpy can vary quite a bit.
Of course, in theory there should only be a single thread when using Fil. So something is wrong with the threadpool-controlling code too, it seems (there were three of the above tracebacks when running under Conda).
Update: threadpoolctl
does not seem to reduce number of threads in NumPy, unclear why. Filed an issue: https://github.com/joblib/threadpoolctl/issues/121
It's not clear that current approach of limiting to one thread is correct (assuming it can be fixed). Zeroed out new mmap() doesn't actually use any memory, should we really be counting all of it? And if the user is using BLAS, the profiling will be ignoring potentially a large chunk of memory, especially on machines with high core count.
Alternatives:
import numpy
so harder to tie it to e.g. filling in the array contents.userfaultfd
to track when pages become dirty. This would give accurate callstack attribution. Would only work on Linux, needs ptrace capability. Scary to implement.For alternative 3, checking how much of mmap is filled coudl be done whenever we check for a new peak, which should ... correctly catch peaks, I think.
For alternative 3, looks like the info is available on macOS via the vmmap
utility. https://github.com/rbspy/proc-maps wraps underlying API, although not with the info we'd need. The latter also claims it requires root and won't work with SIP, and yet I was able to do that on my macOS setup... possibly that's for arbitrary processes? Which is not a use case Fil has. So might work fine.
As a short-term workaround until alternative 3 above is implemented, I'm going to make sure numpy
is always imported before profiling starts. The memory used by numpy
won't get counted, but in many ways that's not under the user's control anyway. So seems like a reasonable way to at least give consistent results (it's not like Fil guarantees it tracks everything, anyway).
On Linux, the data is in /proc/self/smaps
. There is a Rust parser (in procfs
) but it does a bunch of allocation and can be expected to be pretty slow. https://man7.org/linux/man-pages/man5/proc.5.html documents the format. We would need to parse just:
The data structure representation for this is probably:
mmap()
). The problem with this is that retrieving 2 is likely to be expensive, so doing it on every free()
is ... not ideal. Need to measure of course, but it's "open file, read file potentially as large as 1MB, parse it". For Sciagraph this is a little less problematic since there's sampling, but still.
One possible heuristic: only do the parse if there have been minor page faults in the interim since last check, presuming minor page faults are good indicator (need to check) and getrusage
is sufficiently cheap (again, need to check). This may again be more viable with Sciagraph, depending on measurements.
/proc/self/smaps_rollup
is another alternative to getrusage
.
https://github.com/javierhonduco/bookmark reads /proc/self/pagemap
Discussed in https://github.com/pythonspeed/filprofiler/discussions/297