Further overhead reductions

Theelx commented 1 year ago

This PR changes the data structures used to lower overhead. Specifically, a preshmap (pre-hashed map) is used for checking whether a thread has been seen before, which is significantly faster than a simple threading.get_ident() call every time. Additionally, a significant source of overhead is reduced by eliminating one of the hpTimer() calls. Each call can take up to 50 cycles for nanosecond resolution (on Linux), and overall the two calls together summed to half of the overhead. By removing one of the calls, the accuracy hasn't been measurably reduced (since the callback is so much faster than when hpTimer() was originall added). The last major change is that the STL unordered_map container is no longer used, as it was causing significant overhead when hashing the keys, and a parallelized version can take advantage of multiple cores better. Preshed and cymem are both submodules because trying to include them from a pypi installation breaks on a pyenv virtual environment (on my system).

Another fairly minor change is that with the release of Cython 3.0.0b1, the default for cdef functions has been changed to propagate python exceptions. This imposes a 3x speed penalty, raising it to unacceptable levels, so I added noexcept to function signatures to avoid this issue, since our cdef functions don't raise python exceptions.

Here's a table showing the current state of overhead. Here's a comment explaining jsonpickle.	Kernprof Version	jsonpickle Slowdown
3.5.1	3x	60x
4.0.2	~1.9x	30x
This PR	~1.5x	10x

Note: Please test this on async code! I haven't had a chance to, and I don't know if the threading handling is entirely accurate with async code.

Theelx commented 1 year ago

It seems as if there are a decent number of test failures. Please consider this a draft pull request for now, until I can get them all fixed (hopefully in the next week or so).

ta946 commented 1 year ago

hey guys, any updates on this? Just wrote some recursive functions with heavily nested but light for loops eg: dict update stuff (i know, worst case scenario. Also loops in python 😭) Saw a 10x slowdown compared to manual timing

might create a new branch in the autoprofiler fork and merge the current state of this pr to reduce overhead until you have some free time to spare for this

pyutils / line_profiler

Further overhead reductions #203