rougier / from-python-to-numpy

An open-access book on numpy vectorization techniques, Nicolas P. Rougier, 2017
http://www.labri.fr/perso/nrougier/from-python-to-numpy
Other
2.03k stars 339 forks source link

anatomy of an array: multidimensional array start and stop from view #108

Open empeirikos opened 1 year ago

empeirikos commented 1 year ago

At the end of the section "Anatomy of an array" you give an example of how given a view of a multidimensional array we can infer the start:stop:step structure of each index. I have compiled the code and it works fine. But there is a line of code that I have trouble following: offset_stop = (np.byte_bounds(view)[-1] - np.byte_bounds(base)[-1]-1)//itemsize Following the one-dimensional array example we are trying to find the offset at the end of the two arrays ; for this reason we use the address of the last byte in the view array minus the address of the last byte in the base array. This difference is converted into a difference in terms of memory blocks by dividing with itemsize. However there is a '-1' in the expression; initially I thought it was a mistake but if I remove it the code does not work. Still I cannot make sense of it since it does not correspond to an offset in memory blocks. I am sure I am missing something trivial that would be immediately recognized by someone with more experience. Would you be able to help?

rougier commented 1 year ago

Good question and I don't remember exactly :) I think the last -1 is to have the offset stop just at the last start of last byte and not the byte after. This would need to be tested with a very simple array and check for the value of the two bounds.

empeirikos commented 1 year ago

Many thanks for your reply. Your online book has been a great help; I realise that I don't even know 10% of numpy!

I am trying to demonstrate that if we reshape an array it is faster to go through its elements by preserving index precedence. Unfortunately I am not having much luck and performance is pretty much equal (I have copied the code below). Is there some other way to show that index precedence matters for performance?

import numpy as np
import timeit

a = np.arange(1,37).astype(np.int16).reshape((3,2,2,3))

def without_precedence(a):
    sum=0
    n=0
    for l in range(a.shape[3]):
        for k in range(a.shape[2]):
            for j in range(a.shape[1]):
                for i in range(a.shape[0]):
                    sum+=a[i,j,k,l]
                    n=n+1
    return sum, n

def with_precedence(a):
    sum=0
    n=0
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            for k in range(a.shape[2]):
                for l in range(a.shape[3]):
                    sum+=a[i,j,k,l]
                    n=n+1
    return sum, n

if __name__ == "__main__":
    print(timeit.timeit("with_precedence(a)", "from __main__ import with_precedence, a",number=100000))
    print(timeit.timeit("without_precedence(a)","from __main__ import without_precedence, a",number=100000))

On Thu, 2 Feb 2023 at 12:05, Nicolas P. Rougier @.***> wrote:

Good question and I don't remember exactly :) I think the last -1 is to have the offset stop just at the last start of last byte and not the byte after. This would need to be tested with a very simple array and check for the value of the two bounds.

— Reply to this email directly, view it on GitHub https://github.com/rougier/from-python-to-numpy/issues/108#issuecomment-1413631366, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATE2KDUH5SLAMHGMXRGFJQLWVOPHTANCNFSM6AAAAAATZV5JWA . You are receiving this because you authored the thread.Message ID: @.***>

rougier commented 1 year ago

Why do you think it would make a difference?

empeirikos commented 1 year ago

I am thinking in terms of latency and the way cache memory operates in the CPU. Since the array is stored as a one-dimensional block of memory, if the program has to use a stride of multiple bytes to move along the axis=0 direction, it will have to restart the read operation which typically requires more clock cycles. The nice thing about the axis=1 direction is that as soon as the program starts reading values every next memory block is consecutive so there is no latency. Then there is the cache size (this effect is more debatable); for a level2 cache and a large array above its size, the system stores successive blocks of the array and again if I use a stride of a few thousand bytes to move along axis=0 I will not be able to utilize the values stored in the more efficient level2 cache. This will generate a cache miss and the system will have to retrieve values from less efficient memory. By the way, I realise that my theory might be missing additional enhancements from SIMD that summing along axis=0 generates (I have no idea if numpy actually does this).

On Fri, 3 Feb 2023 at 21:04, Nicolas P. Rougier @.***> wrote:

Why do you think it would make a difference?

— Reply to this email directly, view it on GitHub https://github.com/rougier/from-python-to-numpy/issues/108#issuecomment-1416400229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATE2KDVXZFY2W7AIIGXZXDTWVVXEXANCNFSM6AAAAAATZV5JWA . You are receiving this because you authored the thread.Message ID: @.***>

rougier commented 1 year ago

Oh right, cache! I'm not sure the difference is measurable since python loops would eat most of the CPU time. You might have a better chance with numba that will reduce Python overcost. Or you can try to use nditer.