Open Theelx opened 1 year ago
It seems as if there are a decent number of test failures. Please consider this a draft pull request for now, until I can get them all fixed (hopefully in the next week or so).
hey guys, any updates on this? Just wrote some recursive functions with heavily nested but light for loops eg: dict update stuff (i know, worst case scenario. Also loops in python 😭) Saw a 10x slowdown compared to manual timing
might create a new branch in the autoprofiler fork and merge the current state of this pr to reduce overhead until you have some free time to spare for this
This PR changes the data structures used to lower overhead. Specifically, a preshmap (pre-hashed map) is used for checking whether a thread has been seen before, which is significantly faster than a simple
threading.get_ident()
call every time. Additionally, a significant source of overhead is reduced by eliminating one of thehpTimer()
calls. Each call can take up to 50 cycles for nanosecond resolution (on Linux), and overall the two calls together summed to half of the overhead. By removing one of the calls, the accuracy hasn't been measurably reduced (since the callback is so much faster than whenhpTimer()
was originall added). The last major change is that the STL unordered_map container is no longer used, as it was causing significant overhead when hashing the keys, and a parallelized version can take advantage of multiple cores better. Preshed and cymem are both submodules because trying to include them from a pypi installation breaks on a pyenv virtual environment (on my system).Another fairly minor change is that with the release of Cython 3.0.0b1, the default for cdef functions has been changed to propagate python exceptions. This imposes a 3x speed penalty, raising it to unacceptable levels, so I added
noexcept
to function signatures to avoid this issue, since our cdef functions don't raise python exceptions.Note: Please test this on async code! I haven't had a chance to, and I don't know if the threading handling is entirely accurate with async code.