pydata / numexpr

Fast numerical array expression evaluator for Python, NumPy, Pandas, PyTables and more
https://numexpr.readthedocs.io/en/latest/user_guide.html
MIT License
2.24k stars 212 forks source link

calcuting the distance between matrix and vector - some difference withnumpy #250

Closed oak-tree closed 8 months ago

oak-tree commented 7 years ago

Hello, I'm calculating the distance between all rows of matrix m and some vector v. m is a large matrix, about 500,000 rows and 2048 column. v is the size of (1,2048)

Calculation phase:

numpy

dist_2 = np.sum((nodes - node) ** 2, axis=1)

numpexpr

dist_2 = ne.evaluate('sum((nodes - node) ** 2, axis=1)')

Finding min k phase:

numpy, numexpr

indx = np.argpartition(dist_2, k)[:k] distance = dist_2[indx]

First i'm experience some different result between the two ways of calculation

vector_size = 2000 rows_size=500000 v = np.random.rand(1, vector_size).astype('float32') mv = np.random.rand(rows_size, vector_size).astype('float32') m = np.asmatrix(mv)

the index of both calcution are the same but the distance can be different start from the 4 digit after the dot

for example:

indx_numpy__ = [157516 4720] indx_numexpr = [157516 4720] distance_k_numpy__ = [ 302.80163574 307.67181396] distance_k_numexpr = [ 302.80172729 307.67221069]

Note that above is when using float32 but when I used float64 the distance seems to be equal on a debugger but the logical == returned false

Questions:

  1. Any idea what there is a different between the distances ?
  2. Any idea how to speed up? on this kind of metrics it takes on average 3.5 seconds with numpy and 2.27 seconds with numexpr. My goal is to get it under a second

Edit:

3rd option is to use sklearn dist_2= euclidean_distances(node, nodes,squared=True).flatten() in their source they do dist(x,y)^2 = <x,y>-2<x,y>+<y,y> it result with 0.92 seconds on the average!

Additional Questions:

  1. I wonder if numexpr can do the same.
  2. Moreover, since the matrix m is fixed and only the vector v is keep on changing is there any options to use the idea of sklearn with numexpr but doing the <y,y> for every row in advance? Basically to keep 1 long vector of <y,y> of every row of m and to keep m as well then just recalculating the <x,y> once of each rows and do the above? What do you think
robbmcleod commented 7 years ago

Unfortunately the NumExpr 2.6 reductions are single threaded so you will not see a performance improvement from using sum inside NumExpr. I would suggest you try,

dist_2 = ne.evaluate('(nodes - node) ** 2').sum(axis=1)

I'm not clear on why for float32 there's a difference between NumPy and NumExpr 2.6, but if it changes with doing the sumations as above with NumPy after the return then likely the problem is within the NE reduction machinery. Also try pre-allocation of your output to stop NumExpr from using output buffering,

temp = np.empty_like( nodes ) ne.evaluate( '(nodes-node)**2', out=temp) dist_2 = temp.sum(axis=1)

That will be harder on your RAM and memory bandwidth, however. Also try transposing your matrix and averaging over axis 0 instead, since I assume you have C-ordered arrays?

How much RAM do you have? Creating additional permanent arrays for nodes_sqr in memory is of course simple (for both NumPy and NumExpr) but your array is already 4 GB. It may help to pre-compute this but on the other hand it may simply further restrict your memory bandwidth. It would depend on a lot on your computer's configuration.

How many threads are you using for NumExpr? Don't use hyperthreading with it, use the number of virtual cores.

nouiz commented 7 years ago

This page explain why the answer isn't exactly the same:

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

In short doing a+(b+c) don't give the same result at (a+b)+c in float. Doing the reduction in parallel change the order of addition, so it change the result. But in this case, it should give a result that is more precise! For the same inputs, compare the reduction in numpy in float32 vs in float64 vs ne in float32 and float64. The ne result in float32 should be closer to the float64 results then numpy in float32.

Note, this is true in general, I'm pretty sure there is corner case where this is false, so to be really sure, it should be tried in multiple cases.

In Theano, we do the reduction with a float64 accumulator when the inputs are in float32. This give more precise results without slowing down our computation as the bottleneck is reading from the memory, not the computation. I think numpy use a float32 accumulator, at least, in the past that is what it was doing. I would suggest that you investigate for NE3 to use a float64 accumulator by default, so by default NE3 would be faster and more precisse then NE2 and NumPy.

Fred

On Tue, Feb 28, 2017 at 2:58 PM Robert McLeod notifications@github.com wrote:

Unfortunately the NumExpr 2.6 reductions are single threaded so you will not see a performance improvement from using sum inside NumExpr. I would suggest you try,

dist_2 = ne.evaluate('(nodes - node) ** 2').sum(axis=1)

I'm not clear on why for float32 there's a difference between NumPy and NumExpr 2.6, but if it changes with doing the sumations as above with NumPy after the return then likely the problem is within the NE reduction machinery. Also try pre-allocation of your output to stop NumExpr from using output buffering,

temp = np.empty_like( nodes ) ne.evaluate( '(nodes-node)**2', out=temp) dist_2 = temp.sum(axis=1)

That will be harder on your RAM and memory bandwidth, however. Also try transposing your matrix and averaging over axis 0 instead, since I assume you have C-ordered arrays?

How much RAM do you have? Creating additional permanent arrays for nodes_sqr in memory is of course simple (for both NumPy and NumExpr) but your array is already 4 GB. It may help to pre-compute this but on the other hand it may simply further restrict your memory bandwidth. It would depend on a lot on your computer's configuration.

How many threads are you using for NumExpr? Don't use hyperthreading with it, use the number of virtual cores.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/250#issuecomment-283145294, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-43doUWQp8MxFFilBpqnDOGI313oks5rhHxigaJpZM4MOTqx .

robbmcleod commented 7 years ago

Fred,

Ooph, I was writing a response and then Chrome crashed on me. Thanks for the article on machine precision. I've read a similar (yet less detailed) manuscript long ago but the refresher was needed on my part. Unfortunately I think reductions are fairly far into the future for the NE3 roadmap. Certainly they will have to be done better than before to make implementing them worthwhile.

I do want to support explicit casting in NE3 which should make it possible for float32 calculations just use float64 temporaries but return float32 to system RAM.

oak-tree commented 7 years ago

hey @robbmcleod @nouiz Thanks a lot for your quick replies and for throwing some light onto the issue with the differences between numpy result and numexpr result.

Fred, thanks for the interesting information. fpu different make sense. An of topic question: Is it possible to do this calculation faster in theano? Basically calculating <y,x> ( and then do -2<x,y> + <y,y> for each row is 400,000 different calculation so it should be good situation for the gpu to do it in parallel

@robbmcleod I have more than 8GB so so far no issue. Although I'm looking for real time solution. What about the idea of splitting it to <x,y> rows calculation in numexpr is it possible? I'll check later on the idea of buffering space for the output solution in advance.

Thanks, Alon hey @robbmcleod @nouiz Thanks a lot for your quick replies and for throwing some light onto the issue with the differences between numpy result and numexpr result.

Fred, thanks for the interesting information. fpu different make sense. An of topic question: Is it possible to do this calculation faster in theano? Basically calculating <y,x> ( and then do -2<x,y> + <y,y> for each row is 400,000 different calculation so it should be good situation for the gpu to do it in parallel

@robbmcleod I have more than 8GB so so far no issue. Although I'm looking for real time solution. What about the idea of splitting it to <x,y> rows calculation in numexpr is it possible? I'll check later on the idea of buffering space for the output solution in advance.

Thanks, Alon

On Wed, Mar 1, 2017 at 11:53 PM, Robert McLeod notifications@github.com wrote:

Fred,

Ooph, I was writing a response and then Chrome crashed on me. Thanks for the article on machine precision. I've read a similar (yet less detailed) manuscript long ago but the refresher was needed on my part. Unfortunately I think reductions are fairly far into the future for the NE3 roadmap. Certainly they will have to be done better than before to make implementing them worthwhile.

I do want to support explicit casting in NE3 which should make it possible for float32 calculations just use float64 temporaries but return float32 to system RAM.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/250#issuecomment-283482803, or mute the thread https://github.com/notifications/unsubscribe-auth/AAouEdMODQ_lXdOv6u5AQxTmCylRZhdAks5rhejUgaJpZM4MOTqx .

oak-tree commented 7 years ago

Hey, Another question: Looking at numpy doc it seems I do have C order arrays. What different does it make on performance when summing on the transpose matrix?

Thanks!

nouiz commented 7 years ago

What is the <> operator? Theano fuse consecutive elemwise. On the GPU, in rare case for now, we for reduction and elemwise. Intel have a branch where on CPU, they fuse some or all reduction and elemwise, I didn't look in detail.

So Theano will do some optimization for your use cases and depending on the operator, could do more. Give it a try. We aren't great on CPU parallelism, but the Intel torch fork help on that.

Le jeu. 2 mars 2017 13:31, oak-tree notifications@github.com a écrit :

hey @robbmcleod @nouiz Thanks a lot for your quick replies and for throwing some light onto the issue with the differences between numpy result and numexpr result.

Fred, thanks for the interesting information. fpu different make sense. An of topic question: Is it possible to do this calculation faster in theano? Basically calculating <y,x> ( and then do -2<x,y> + <y,y> for each row is 400,000 different calculation so it should be good situation for the gpu to do it in parallel

@robbmcleod I have more than 8GB so so far no issue. Although I'm looking for real time solution. What about the idea of splitting it to <x,y> rows calculation in numexpr is it possible? I'll check later on the idea of buffering space for the output solution in advance.

Thanks, Alon hey @robbmcleod @nouiz Thanks a lot for your quick replies and for throwing some light onto the issue with the differences between numpy result and numexpr result.

Fred, thanks for the interesting information. fpu different make sense. An of topic question: Is it possible to do this calculation faster in theano? Basically calculating <y,x> ( and then do -2<x,y> + <y,y> for each row is 400,000 different calculation so it should be good situation for the gpu to do it in parallel

@robbmcleod I have more than 8GB so so far no issue. Although I'm looking for real time solution. What about the idea of splitting it to <x,y> rows calculation in numexpr is it possible? I'll check later on the idea of buffering space for the output solution in advance.

Thanks, Alon

On Wed, Mar 1, 2017 at 11:53 PM, Robert McLeod notifications@github.com wrote:

Fred,

Ooph, I was writing a response and then Chrome crashed on me. Thanks for the article on machine precision. I've read a similar (yet less detailed) manuscript long ago but the refresher was needed on my part. Unfortunately I think reductions are fairly far into the future for the NE3 roadmap. Certainly they will have to be done better than before to make implementing them worthwhile.

I do want to support explicit casting in NE3 which should make it possible for float32 calculations just use float64 temporaries but return float32 to system RAM.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/250#issuecomment-283482803, or mute the thread < https://github.com/notifications/unsubscribe-auth/AAouEdMODQ_lXdOv6u5AQxTmCylRZhdAks5rhejUgaJpZM4MOTqx

.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/250#issuecomment-283738293, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-5JtY2OCPFRG5SZDXErlPrEo43AMks5rhwsIgaJpZM4MOTqx .

oak-tree commented 7 years ago

<x,y> is the inner product between x and y where x and y are vectors. A good way I saw online to calculate fast the distance between those vector is to do <x,y> - 2<x,y> + <y,y>. This helps to reduce the numbers of calculation as <x,y> this can be calculated once per row. <y,y> this can be calculate in advance per row The data I have is a matrix M and vector x. I'm want to calculate all the distance between the vector x and each row vector in matrix M

nouiz commented 7 years ago

This page make me think the following would work:

https://docs.scipy.org/doc/numpy/reference/generated/numpy.inner.html

import theano x = theano.tensor.matrix() y = theano.tensor.vector()

inner_xy = theano.tensor.tensordot(x, y, axes=(-1,-1)) inner_yy = theano.tensor.tensordot(y, y, axes=(-1,-1))

then: out = inner_xy - 2*inner_xy + inner_yy

f=theano.function([x, y], out) f(your_matrix, your_vector)

On Fri, Mar 3, 2017 at 6:37 AM oak-tree notifications@github.com wrote:

<x,y> is the inner product between x and y where x and y are vectors. A good way I saw online to calculate fast the distance between those vector is to do <x,y> - 2<x,y> + <y,y>. This helps to reduce the numbers of calculation as <x,y> this can be calculated once per row. <y,y> this can be calculate in advance per row The data I have is a matrix M and vector x. I'm want to calculate all the distance between the vector x and each row vector in matrix M

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/250#issuecomment-283932768, or mute the thread https://github.com/notifications/unsubscribe-auth/AALC-8EOHSGWHlFSism4LeoYXTLkhtRbks5rh_tzgaJpZM4MOTqx .

github-actions[bot] commented 8 months ago

Message to comment on stale issues. If none provided, will not mark issues stale