Closed emmaai closed 2 months ago
Cool. Can you add an example of a benchmark (you can put it in bench/
) where the new feature is exercised and where we can see the advantage of the new method? It would be nice if you can add the results of your machine as a docstring at the beginning of the script; alternatively, you can use a jupyter notebook too. Thanks!
I am sorry if my comment is dumb, and I am not the maintainer for numexpr, so my opinion is not worth much, but I have the impression your benchmark compares apples to oranges: numpy 2 threads vs numexpr 32 (?) threads (2 explicit threads x 16 builtin threads) or how does the builtin numexpr threading interact with manual threading anyway? Also I would be interested in a benchmark against "normal"/builtin numexpr threading, which I think is more interesting than against numpy. Unless there is something I don't understand (very likely), I don't expect much difference.
There are many benchmark cases both against numpy and different numbers of threads, under the folder bench/. The whole point of this pr is to avoid implementing the same mechanism of numexpr in numpy if multithreading, numexpr is not thread safe due to global dict, which is stated clearly in pr content.
I’m not quite sure if I understand the comment “how does the builtin numexpr interact with manual threading”. Well, better they don’t? If they do, usually implies race condition. The change here is to guarantee that they don’t. Oversubscription (I have only 16 cores) is a common technique that the cpu utilisation is low due to io, cuz in reality the other thread(s) might be loading the data not using cpus that much.
As commented in benchmark file, it has 2 threads because the presumption is that memory is only enough for 2 chunks computation. Then there are two choices, one is to threading on smaller chunks with numpy, the other is to hand over threading to numexpr. To me clearly the later is an easier option. With chunk being smaller and oversubscription of threads, numexpr is not doing as well as other conditions (again they’re available under bench/ folder). However, it’s still much better than single numpy, and MUCH less work.
On Tue, 27 Aug 2024 at 5:58 PM, Gaëtan de Menten @.***> wrote:
I am sorry if my comment is dumb, and I am not the maintainer for numexpr, so my opinion is not worth much, but I have the impression your benchmark compares apples to oranges: numpy 2 threads vs numexpr 32 (?) threads (2 explicit threads x 16 builtin threads) or how does the builtin numexpr threading interact with manual threading anyway? Also I would be interested in a benchmark against "normal"/builtin numexpr threading, which I think is more interesting than against numpy. Unless there is something I don't understand (very likely), I don't expect much difference.
— Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/pull/496#issuecomment-2311885055, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLBWNQY3DKAKV6KYZ7PVY3ZTQ2DFAVCNFSM6AAAAABNDLTO4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRHA4DKMBVGU . You are receiving this because you authored the thread.Message ID: @.***>
Thanks @emmaai for your example. It was more for me (and others!) to understand the way you wanted to use your feature. Now it is much clearer, and sounds good to me. The only thing that I'd ask is to add a new test exercising this new feature; tests are actually the way to ensure that we are not introducing regressions in the future.
Test added to verify thread safety by always manifesting the race condition.
Thanks @emmaai for the added explanation.
I'm following up with this pr, wondering if there is any concern that it can't be merged?
I've just activated the tests in CI, and Mac OSX is reporting a failure. Can you address it?
sorry forgot to commit the change in another file when pushing the test. it should pass now
Thanks @emmaai !
As the title (ref: https://github.com/pydata/numexpr/issues/494), the change:
_numexpr_last
with a dictionary like object aware of contextreasoning:
re-evaluate
and want it to stay the safe as wellasync
re/evaluate, which is to further cater to my specific user case.benchmark case: It's based on my reality that most of time, I got large amount of data and only chunks of data can fit into memory, most of my cpus are idling when io (especially i) happens. If the data can fit into memory and cpus are fully utilized, I guess it doesn't make much difference between threading with
numpy
by chunks vsnumexpr
. Certainly I can implement whatnumexpr
can achieve by further chunking each chunk, but why? Specifically I usedask
threading to schedule the tasks. All I have to do is to packne.evaluate
nicely into a "task" if thread safety is taken care of.