Closed cylinbao closed 7 years ago
Just save A[i] into a local variable and use that in the innermost for loop.
for (i = 0; i < 2; i++) {
int a = A[i];
for (j = 0; j < 4; j++)
C[j] = a + B[j];
}
The reason this works is because a
is represented by a register, which can be concurrently read by as many operations as needed.
Ok. I see. But I have another question here, does it have limit for the number of concurrently reads for a register in ALADDIN?
Nope - see my last sentence :)
Ohoh, I see. Thanks a lot!
If an accelerator has multiple PEs and share a local buffer. Does the buffer(memory) in ALADDIN able to support broadcast mechanism that I don't need to partition it to get correct behavior?
What I want to do is like this
I want to unroll the second for-loop with the factor 4, but the data from array A is the same in that for-loop. As my testing result, I still need to partition all of the three arrays (A, B and C) for 4 times to get correct performance behavior. So I want to confirm that ALADDIN support the data broadcasting or not. Thanks.