We need to implement ASR->ASR passes that implement vectorization. It seems we write a pragma (comment) which will direct the optimizer to apply a given transformation and specify exactly which one and which parameters for the given loop. Much later we can think about creating some heuristic optimizer (possibly profile driven). In this issue we simply have to implement the mechanism and implement all options. So the design space is relatively limited. The hard part is to figure out how to best represent the vector operations in ASR.
Examples of loops:
Array copy
a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
# Equivalently: A[:] = B[:]
for i in range(n):
A[i] = B[i]
->
a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
for i in range(0,n,1024):
A[i*1024:(i+1)*1024] = B[i*1024:(i+1)*1024]
# plus a remainder loop
Possibly this can be transformed into:
a: A[n] = empty(n)
b: B[n] = empty(n)
B[:] = 5
for i in range(0,n,1024):
vector_copy(1024, A[i*1024:(i+1)*1024], B[i*1024:(i+1)*1024])
# plus a remainder loop
This all happens at the ASR level. Then the backend very straighforwardly transforms vector_copy(n, A, B) with n=1024 (or other well defined vector lengths) into a loop over AVX-512 or ARM vector instructions. Or generates LLVM code that LLVM itself knows how to efficiently vectorize.
Matrix-vector multiply
for i in range(n):
y[i] = 0
for j in range(n):
y[i] += A[i,j] * x[j]
We can vectorize the inner loop:
for i in range(n):
y[i] = 0
for j in range(0,n,1024):
y[i] += A[i,j*1024:(j+1)*1024] * x[j*1024:(j+1)*1024]
Or the outer loop:
for i in range(0,n,1024):
y[i*1024:(i+1)*1024] = 0
for j in range(n):
y[i*1024:(i+1)*1024] += A[i*1024:(i+1)*1024,j] * x[j]
We can also permute the loop order, such as:
y[:] = 0
for j in range(n):
for i in range(n):
y[i] += A[i,j] * x[j]
And again we can vectorize the inner and outer loop.
So there are at least 4 possibilities and we have to implement all 4 and allow the user to select using a pragma. There might be more options (I can't remember, but one might perhaps unroll the outer loop but vectorize the inner loop, or something like that).
For the remainder loop there are again at least two options: either iterate exactly over the correct length and not vectorize, or pad it to the nearest vector length and also vectorize.
We need to implement ASR->ASR passes that implement vectorization. It seems we write a pragma (comment) which will direct the optimizer to apply a given transformation and specify exactly which one and which parameters for the given loop. Much later we can think about creating some heuristic optimizer (possibly profile driven). In this issue we simply have to implement the mechanism and implement all options. So the design space is relatively limited. The hard part is to figure out how to best represent the vector operations in ASR.
Examples of loops:
Array copy
->
Possibly this can be transformed into:
This all happens at the ASR level. Then the backend very straighforwardly transforms
vector_copy(n, A, B)
with n=1024 (or other well defined vector lengths) into a loop over AVX-512 or ARM vector instructions. Or generates LLVM code that LLVM itself knows how to efficiently vectorize.Matrix-vector multiply
We can vectorize the inner loop:
Or the outer loop:
We can also permute the loop order, such as:
And again we can vectorize the inner and outer loop.
So there are at least 4 possibilities and we have to implement all 4 and allow the user to select using a pragma. There might be more options (I can't remember, but one might perhaps unroll the outer loop but vectorize the inner loop, or something like that).
For the remainder loop there are again at least two options: either iterate exactly over the correct length and not vectorize, or pad it to the nearest vector length and also vectorize.