Closed oak-tree closed 8 months ago
Unfortunately the NumExpr 2.6 reductions are single threaded so you will not see a performance improvement from using sum
inside NumExpr. I would suggest you try,
dist_2 = ne.evaluate('(nodes - node) ** 2').sum(axis=1)
I'm not clear on why for float32
there's a difference between NumPy and NumExpr 2.6, but if it changes with doing the sumations as above with NumPy after the return then likely the problem is within the NE reduction machinery. Also try pre-allocation of your output to stop NumExpr from using output buffering,
temp = np.empty_like( nodes )
ne.evaluate( '(nodes-node)**2', out=temp)
dist_2 = temp.sum(axis=1)
That will be harder on your RAM and memory bandwidth, however. Also try transposing your matrix and averaging over axis 0 instead, since I assume you have C-ordered arrays?
How much RAM do you have? Creating additional permanent arrays for nodes_sqr in memory is of course simple (for both NumPy and NumExpr) but your array is already 4 GB. It may help to pre-compute this but on the other hand it may simply further restrict your memory bandwidth. It would depend on a lot on your computer's configuration.
How many threads are you using for NumExpr? Don't use hyperthreading with it, use the number of virtual cores.
This page explain why the answer isn't exactly the same:
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
In short doing a+(b+c) don't give the same result at (a+b)+c in float. Doing the reduction in parallel change the order of addition, so it change the result. But in this case, it should give a result that is more precise! For the same inputs, compare the reduction in numpy in float32 vs in float64 vs ne in float32 and float64. The ne result in float32 should be closer to the float64 results then numpy in float32.
Note, this is true in general, I'm pretty sure there is corner case where this is false, so to be really sure, it should be tried in multiple cases.
In Theano, we do the reduction with a float64 accumulator when the inputs are in float32. This give more precise results without slowing down our computation as the bottleneck is reading from the memory, not the computation. I think numpy use a float32 accumulator, at least, in the past that is what it was doing. I would suggest that you investigate for NE3 to use a float64 accumulator by default, so by default NE3 would be faster and more precisse then NE2 and NumPy.
Fred
On Tue, Feb 28, 2017 at 2:58 PM Robert McLeod notifications@github.com wrote:
Unfortunately the NumExpr 2.6 reductions are single threaded so you will not see a performance improvement from using sum inside NumExpr. I would suggest you try,
dist_2 = ne.evaluate('(nodes - node) ** 2').sum(axis=1)
I'm not clear on why for float32 there's a difference between NumPy and NumExpr 2.6, but if it changes with doing the sumations as above with NumPy after the return then likely the problem is within the NE reduction machinery. Also try pre-allocation of your output to stop NumExpr from using output buffering,
temp = np.empty_like( nodes ) ne.evaluate( '(nodes-node)**2', out=temp) dist_2 = temp.sum(axis=1)
That will be harder on your RAM and memory bandwidth, however. Also try transposing your matrix and averaging over axis 0 instead, since I assume you have C-ordered arrays?
How much RAM do you have? Creating additional permanent arrays for nodes_sqr in memory is of course simple (for both NumPy and NumExpr) but your array is already 4 GB. It may help to pre-compute this but on the other hand it may simply further restrict your memory bandwidth. It would depend on a lot on your computer's configuration.
How many threads are you using for NumExpr? Don't use hyperthreading with it, use the number of virtual cores.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/250#issuecomment-283145294, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-43doUWQp8MxFFilBpqnDOGI313oks5rhHxigaJpZM4MOTqx .
Fred,
Ooph, I was writing a response and then Chrome crashed on me. Thanks for the article on machine precision. I've read a similar (yet less detailed) manuscript long ago but the refresher was needed on my part. Unfortunately I think reductions are fairly far into the future for the NE3 roadmap. Certainly they will have to be done better than before to make implementing them worthwhile.
I do want to support explicit casting in NE3 which should make it possible for float32
calculations just use float64
temporaries but return float32
to system RAM.
hey @robbmcleod @nouiz
Thanks a lot for your quick replies and for throwing some light onto the
issue with the differences between numpy
result and numexpr
result.
Fred, thanks for the interesting information. fpu
different make sense.
An of topic question: Is it possible to do this calculation faster in
theano
? Basically calculating <y,x> ( and then do -2<x,y> + <y,y> for
each row is 400,000 different calculation so it should be good situation
for the gpu
to do it in parallel
@robbmcleod
I have more than 8GB so so far no issue. Although I'm looking for real time
solution. What about the idea of splitting it to <x,y> rows calculation in
numexpr
is it possible? I'll check later on the idea of buffering space
for the output solution in advance.
Thanks,
Alon
hey @robbmcleod @nouiz
Thanks a lot for your quick replies and for throwing some light onto the
issue with the differences between numpy
result and numexpr
result.
Fred, thanks for the interesting information. fpu
different make sense.
An of topic question: Is it possible to do this calculation faster in
theano
? Basically calculating <y,x> ( and then do -2<x,y> + <y,y> for
each row is 400,000 different calculation so it should be good situation
for the gpu
to do it in parallel
@robbmcleod
I have more than 8GB so so far no issue. Although I'm looking for real time
solution. What about the idea of splitting it to <x,y> rows calculation in
numexpr
is it possible? I'll check later on the idea of buffering space
for the output solution in advance.
Thanks, Alon
On Wed, Mar 1, 2017 at 11:53 PM, Robert McLeod notifications@github.com wrote:
Fred,
Ooph, I was writing a response and then Chrome crashed on me. Thanks for the article on machine precision. I've read a similar (yet less detailed) manuscript long ago but the refresher was needed on my part. Unfortunately I think reductions are fairly far into the future for the NE3 roadmap. Certainly they will have to be done better than before to make implementing them worthwhile.
I do want to support explicit casting in NE3 which should make it possible for float32 calculations just use float64 temporaries but return float32 to system RAM.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/250#issuecomment-283482803, or mute the thread https://github.com/notifications/unsubscribe-auth/AAouEdMODQ_lXdOv6u5AQxTmCylRZhdAks5rhejUgaJpZM4MOTqx .
Hey,
Another question:
Looking at numpy
doc it seems I do have C order arrays. What different does it make on performance when summing on the transpose matrix?
Thanks!
What is the <> operator? Theano fuse consecutive elemwise. On the GPU, in rare case for now, we for reduction and elemwise. Intel have a branch where on CPU, they fuse some or all reduction and elemwise, I didn't look in detail.
So Theano will do some optimization for your use cases and depending on the operator, could do more. Give it a try. We aren't great on CPU parallelism, but the Intel torch fork help on that.
Le jeu. 2 mars 2017 13:31, oak-tree notifications@github.com a écrit :
hey @robbmcleod @nouiz Thanks a lot for your quick replies and for throwing some light onto the issue with the differences between
numpy
result andnumexpr
result.Fred, thanks for the interesting information.
fpu
different make sense. An of topic question: Is it possible to do this calculation faster intheano
? Basically calculating <y,x> ( and then do -2<x,y> + <y,y> for each row is 400,000 different calculation so it should be good situation for thegpu
to do it in parallel@robbmcleod I have more than 8GB so so far no issue. Although I'm looking for real time solution. What about the idea of splitting it to <x,y> rows calculation in
numexpr
is it possible? I'll check later on the idea of buffering space for the output solution in advance.Thanks, Alon hey @robbmcleod @nouiz Thanks a lot for your quick replies and for throwing some light onto the issue with the differences between
numpy
result andnumexpr
result.Fred, thanks for the interesting information.
fpu
different make sense. An of topic question: Is it possible to do this calculation faster intheano
? Basically calculating <y,x> ( and then do -2<x,y> + <y,y> for each row is 400,000 different calculation so it should be good situation for thegpu
to do it in parallel@robbmcleod I have more than 8GB so so far no issue. Although I'm looking for real time solution. What about the idea of splitting it to <x,y> rows calculation in
numexpr
is it possible? I'll check later on the idea of buffering space for the output solution in advance.Thanks, Alon
On Wed, Mar 1, 2017 at 11:53 PM, Robert McLeod notifications@github.com wrote:
Fred,
Ooph, I was writing a response and then Chrome crashed on me. Thanks for the article on machine precision. I've read a similar (yet less detailed) manuscript long ago but the refresher was needed on my part. Unfortunately I think reductions are fairly far into the future for the NE3 roadmap. Certainly they will have to be done better than before to make implementing them worthwhile.
I do want to support explicit casting in NE3 which should make it possible for float32 calculations just use float64 temporaries but return float32 to system RAM.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/250#issuecomment-283482803, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAouEdMODQ_lXdOv6u5AQxTmCylRZhdAks5rhejUgaJpZM4MOTqx
.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/250#issuecomment-283738293, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-5JtY2OCPFRG5SZDXErlPrEo43AMks5rhwsIgaJpZM4MOTqx .
<x,y> is the inner product
between x and y where x and y are vectors. A good way I saw online to calculate fast the distance between those vector is to do <x,y> - 2<x,y> + <y,y>. This helps to reduce the numbers of calculation as <x,y> this can be calculated once per row. <y,y> this can be calculate in advance per row
The data I have is a matrix M and vector x. I'm want to calculate all the distance between the vector x and each row vector in matrix M
This page make me think the following would work:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.inner.html
import theano x = theano.tensor.matrix() y = theano.tensor.vector()
inner_xy = theano.tensor.tensordot(x, y, axes=(-1,-1)) inner_yy = theano.tensor.tensordot(y, y, axes=(-1,-1))
then: out = inner_xy - 2*inner_xy + inner_yy
f=theano.function([x, y], out) f(your_matrix, your_vector)
On Fri, Mar 3, 2017 at 6:37 AM oak-tree notifications@github.com wrote:
<x,y> is the inner product between x and y where x and y are vectors. A good way I saw online to calculate fast the distance between those vector is to do <x,y> - 2<x,y> + <y,y>. This helps to reduce the numbers of calculation as <x,y> this can be calculated once per row. <y,y> this can be calculate in advance per row The data I have is a matrix M and vector x. I'm want to calculate all the distance between the vector x and each row vector in matrix M
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/250#issuecomment-283932768, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-8EOHSGWHlFSism4LeoYXTLkhtRbks5rh_tzgaJpZM4MOTqx .
Message to comment on stale issues. If none provided, will not mark issues stale
Hello, I'm calculating the distance between all
rows
ofmatrix m
and somevector
v
. m is a large matrix, about 500,000 rows and 2048 column. v is the size of (1,2048)Calculation phase:
numpy
dist_2 = np.sum((nodes - node) ** 2, axis=1)
numpexpr
dist_2 = ne.evaluate('sum((nodes - node) ** 2, axis=1)')
Finding min k phase:
numpy, numexpr
indx = np.argpartition(dist_2, k)[:k] distance = dist_2[indx]
First i'm experience some different result between the two ways of calculation
the index of both calcution are the same but the distance can be different start from the
4
digit after thedot
for example:
Note that above is when using
float32
but when I usedfloat64
the distance seems to be equal on a debugger but thelogical ==
returnedfalse
Questions:
Edit:
3rd option is to use
sklearn
dist_2= euclidean_distances(node, nodes,squared=True).flatten()
in theirsource
they do dist(x,y)^2 = <x,y>-2<x,y>+<y,y> it result with 0.92 seconds on the average!Additional Questions:
numexpr
can do the same.m
is fixed and only the vectorv
is keep on changing is there any options to use the idea ofsklearn
withnumexpr
but doing the <y,y> for every row in advance? Basically to keep 1 long vector of <y,y> of every row of m and to keep m as well then just recalculating the <x,y> once of each rows and do the above? What do you think