lcompilers / lpython

Python compiler
https://lpython.org/
Other
1.51k stars 164 forks source link

Vectorization #640

Open certik opened 2 years ago

certik commented 2 years ago

We need to implement ASR->ASR passes that implement vectorization. It seems we write a pragma (comment) which will direct the optimizer to apply a given transformation and specify exactly which one and which parameters for the given loop. Much later we can think about creating some heuristic optimizer (possibly profile driven). In this issue we simply have to implement the mechanism and implement all options. So the design space is relatively limited. The hard part is to figure out how to best represent the vector operations in ASR.

Examples of loops:

Array copy

a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
# Equivalently: A[:] = B[:]
for i in range(n):
    A[i] = B[i]

->

a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
for i in range(0,n,1024):
    A[i*1024:(i+1)*1024] = B[i*1024:(i+1)*1024]
# plus a remainder loop

Possibly this can be transformed into:

a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
for i in range(0,n,1024):
    vector_copy(1024, A[i*1024:(i+1)*1024], B[i*1024:(i+1)*1024])
# plus a remainder loop

This all happens at the ASR level. Then the backend very straighforwardly transforms vector_copy(n, A, B) with n=1024 (or other well defined vector lengths) into a loop over AVX-512 or ARM vector instructions. Or generates LLVM code that LLVM itself knows how to efficiently vectorize.

Matrix-vector multiply

for i in range(n):
    y[i] = 0
    for j in range(n):
        y[i] += A[i,j] * x[j]

We can vectorize the inner loop:

for i in range(n):
    y[i] = 0
    for j in range(0,n,1024):
        y[i] += A[i,j*1024:(j+1)*1024] * x[j*1024:(j+1)*1024]

Or the outer loop:

for i in range(0,n,1024):
    y[i*1024:(i+1)*1024] = 0
    for j in range(n):
        y[i*1024:(i+1)*1024] += A[i*1024:(i+1)*1024,j] * x[j]

We can also permute the loop order, such as:

y[:] = 0
for j in range(n):
    for i in range(n):
        y[i] += A[i,j] * x[j]

And again we can vectorize the inner and outer loop.

So there are at least 4 possibilities and we have to implement all 4 and allow the user to select using a pragma. There might be more options (I can't remember, but one might perhaps unroll the outer loop but vectorize the inner loop, or something like that).

For the remainder loop there are again at least two options: either iterate exactly over the correct length and not vectorize, or pad it to the nearest vector length and also vectorize.

czgdp1807 commented 2 years ago

Will start working on this after completing #651.

tanay-man commented 9 months ago

I would like to implement this. Can anyone provide resources to get started on? Is this similar to loop_vectorise in libasr/pass/ ?