mn416 / QPULib

Language and compiler for the Raspberry Pi GPU
Other
429 stars 64 forks source link

Trouble extending HeatMap to kernels with more than 3 rows #11

Closed cbayley108 closed 6 years ago

cbayley108 commented 6 years ago

Hi @mn416 ,

I am trying to use QPULib for a convolutional neural network and adapted the HeatMap example code to perform a convolution. I ran into some interesting problems and tried iteratively changing the HeatMap code to see where the problem was. It seems that creating 5 cursor objects rather than 3, and initializing them to copies of the same area as the original 3 to avoid out of bounds issues, results in a different result than simply using 3. These rows are still primed, advanced and finished but their values are simply never used. Running on the emulator returns the correct result but running on a pi gives an incorrect result. The results are incorrect running on any number of QPUs and additionally the result is not the same every time. Interestingly, 4 rows did not work as well until I changed the order of gathers and receives in the Cursor advance function. Please let me know if this is an error on my end or something wrong with the library. Thanks!

cbayley108 commented 6 years ago

Update:

Every value involved in the sum is correct (that is the left, row, and right values) yet the sum will vary based on the number of rows which are created. When writing partial sums to memory all the values match. For example, writing the sum of all of the left and right values will achieve the same result, writing the sum of all of the row values will achieve the same result, writing the sum of all values will diverge.

mn416 commented 6 years ago

Hi @cbayley108,

Thanks for the detailed reports, that does sound strange.

Based only on what you've said, this could be related to the number of outstanding gathers, which obviously rises with the number of cursors. Reading pages 39 and 40 of the VideoCore IV manual, the hardware limit on the number of outstanding gathers looks like it is 8 but there is also some suggestion it is only 4.

Supposing it is only 4, then I can see why you would need to modify the advance function to support 4 cursors because the way the code is written at the moment will result in n+1 outstanding gathers for n cursors. By reordering the gather and receive in advance (as you did), you can make this only n outstanding gathers for n cursors. And if the outstanding gather limit is only 4, then 5 cursors is too many and the code will need more serious adaptation.

If you think the hardware limit is actually 8, so there is a bug in the compiler, then please do let me know and I'll try to look at it when I get some time. But at first sight, a hardware limit of 4 does seem to explain the issues you are seeing. You could try writing a tiny QPULib program to determine the hardware limit experimentally.

Thanks for the excellent report!

Edit: I hope you don't mind, but I'll rename the issue as it's not yet clear to me that this is an error with HeatMap, the compiler, or just the documentation (i.e. stating the hardware limit as 8 when it might be 4).

mn416 commented 6 years ago

As an aside, you might take a look at the 1D convolution algorithm I proposed in issue #9. Generalising this to 2D might be a better approach for larger kernel sizes. In that approach, you rotate the kernel rather than the data.

cbayley108 commented 6 years ago

Thanks for the quick reply! After some investigating using a simple adding kernel the limit on gathers does seem to be 8. By this I mean having 8 outstanding gathers is an issue, and only 7 should be allowed at once. Another thought I had was that there may be an issue with the store instruction, but after swapping this out for the blocking write counter part the issue remained the same. Also, the algorithm you described in issue #9 doesn't seem correct; each value in acc will end up being the sum of one value in chunk multiplied by every weight value. This will result in each value in the vector being a sum of one input multiplied by each weight rather than each value in the vector being the sum of every value of the input multiplied by the corresponding weight.

mn416 commented 6 years ago

By this I mean having 8 outstanding gathers is an issue, and only 7 should be allowed at once.

Ok thanks, I'll try to take a look at this one then.

Also, the algorithm you described in issue #9 doesn't seem correct; each value in acc will end up being the sum of one value in chunk multiplied by every weight value

Ah, of course, sorry about that!

mn416 commented 6 years ago

Hi @cbayley108,

I've run my own experiment which suggests that the hardware limit is 4 outstanding gathers:

void test(Ptr<Int> p)
{
  Int x, y, z;
  gather(p+index());
  gather(p+16+index());
  gather(p+32+index());
  gather(p+48+index());
  //gather(p+64+index());

  Int i = 0;
  While (i < 1024)
    i++;
  End

  receive(x);
  receive(y);
  receive(z);
  receive(z);
  //receive(z);
  *p = x+y;
}

This program produces the expected result, but if you uncomment the two lines (to introduce 5 outstanding gathers rather than 4) then it yields the incorrect result.

cbayley108 commented 6 years ago

I apologize, you're right. The kernel I used to test it also seems to show errors after 4 gathers.

mn416 commented 6 years ago

I think we've resolved this issue, so I'll close it for now. Please reopen if you think otherwise.