mn416 / QPULib

Language and compiler for the Raspberry Pi GPU
Other
431 stars 63 forks source link

Possible problems with register values #12

Open cbayley108 opened 6 years ago

cbayley108 commented 6 years ago

Hi @mn416 , I've had a couple of issues which might stem from some problems with register values. The first problem is that there are segfaults or invalid frees from just calling compile on my kernel function. I believe this was due to excessive loop unrolling and changing for(int ... loops into For(Int... loops seems to fix the problem. Another problem I've seen a couple times is that int values do not seem to keep their values between iterations of a For loop. I'm not sure if that is intended behavior and I shouldn't use int values in a For loop over an Int value, so please let me know what the correct usage would be. As an example the following code would write a value of 1 for every output value:

int val = 0; For(Int i = 0, i < len, i = i +16) val++; out[i] = val; End

cbayley108 commented 6 years ago

Hi, on a similar note it seems that calls to rotate with the second value as an Int vector will work on the emulator but not on the actual QPU. It seems like something that wouldn't be possible on the QPU so I'm guessing this is a mistake on the emulator side. However I have an issue where I'd like to rotate a vector by a constant amount, but that constant amount is in an Int vector not an int. I'm forced to use an Int here because when using an int in a normal for loop there is a segfault at compilation of the kernel as I outlined above. Is there a way to extract the int value from the Int? I tried using Int.expr->intLit but this value seems to always be 0. Please let me know if there is a way I can get an int value which matches the value of an Int in a For loop, thanks!

mn416 commented 6 years ago

Hi @cbayley108,

The first problem is that there are segfaults or invalid frees from just calling compile on my kernel function

So that could be a bug in QPULib. Even if the size of the kernel code (after unrolling) exceeds some limit, the compile function should at least fail gracefully. The best thing to do here is to fork the QPULib repo, and then push the program that yields the segfault. That way, I can easily reproduce it and, hopefully, fix it.

As an example the following code would write a value of 1 for every output value:

int val = 0;
For(Int i = 0, i < len, i = i +16)
  val++;
  out[i] = val;
End

That is the expected behaviour, because int code is executed at compile time, i.e. when you call the compile function, not at runtime, whereas the For loop is a run-time construct, executed on the QPUs.
So when you call compile(f), f is actually executed on the ARM core with constructs such as For, and Int assignments, being used to generate code for the QPUs. So think of f as a code generator: constructs like for, and computations on int, do no exist by the time the code is running on the QPUs -- they've already been evaluated away when the compile function was called.

mn416 commented 6 years ago

on a similar note it seems that calls to rotate with the second value as an Int vector will work on the emulator but not on the actual QPU.

I think this should work on the actual QPU too (as long as the value is in the range 0..15). I'll take a look (might be a few days before I can do that).

Is there a way to extract the int value from the Int?

Remember the Int exists at run-time (i.e. on the QPUs), whereas the int only exists at compile-time (i.e. on the ARM core), so a function to convert from Int->int doesn't really make sense within a kernel.

The only way to do it is to write the Int to memory, then return from the kernel and the ARM core can read it as an int.

Thanks for the reports.

cbayley108 commented 6 years ago

Thanks for the response! The QPU doesn't seem to like receiving an Int as an argument for rotate. Or possibly it isn't an issue with the actual QPU but when running without the emulator an assert fails in Encode.cpp.

cbayley108 commented 6 years ago

Also because this project is for a school assignment I am not able to make the code publicly available according to our course policy. However, the issue occurs when there are (highly) nested for loops, or even nested loops of either For or for. In general it also seems to occur when there are more nested loops added in terms of breadth, not depth. In my code because I am performing multiple 2D convolutions over multiple images the total depth of the nested for loops gets pretty high, up to 6 deep. I understand this is probably more than was intended but I'm now at the point where I cannot use local data structures as I cannot create more for loops to iterate through them, and they cannot be indexed by an Int in a For loop.

For example, I'd like to precompute some data and store it in a 2d array as follows: Float data [HEIGHT][WIDTH]; for(int i = 0; i < HEIGHT; i++){ for(int j = 0; j < WIDTH; j++){ data[i][j] = // some value } } However I am not able to create more for loops without causing segfaults in emulation mode, or just errors and silent failures on the actual QPU. Because of this I cannot create a local data structure like above. Do you have any suggestions for a temporary work around in the mean time? Additionally I cannot collapse this into one for loop as this value is dependent on j, which is not a power of 2 so I cannot do any sort of division to get the value, and using other values to keep track of it puts me over the register limit.

Overall the library has worked pretty well and I've managed to get a working 2D convolution of an image which is not 16 aligned working slightly faster than the CPU version. If I can use local values as mentioned above it seems it may be possible to double that performance as well. Also I'd like to note that some of your scalar times for the example code seems a bit off from the performance that I see. For HeatMapScalar the listed run time is about 400s, but in practice I seem to see about 100s, however the parallel performance matches very well.

mn416 commented 6 years ago

The QPU doesn't seem to like receiving an Int as an argument for rotate. Or possibly it isn't an issue with the actual QPU but when running without the emulator an assert fails in Encode.cpp.

Turns out this was due to a single character typo in the assertion. I've pushed a fix: c0f798b. Looks like it's working now. Thanks for the report.

mn416 commented 6 years ago

Also because this project is for a school assignment I am not able to make the code publicly available according to our course policy. However, the issue occurs when there are (highly) nested for loops, or even nested loops of either For or for

It's a pity you're not allowed to provide a code sample, especially as you're seeing undesirable behaviour in emulation mode where it should be easy for me to diagnose. It's difficult for me to fix if I can't easily reproduce the problem, but thanks anyway for the report.

Overall the library has worked pretty well and I've managed to get a working 2D convolution of an image which is not 16 aligned working slightly faster than the CPU version.

Nice! Glad to hear you at least get some speed up using the library. I guess there is huge scope to improve the performance of the library, but I don't have much time to work on it these days.

For HeatMapScalar the listed run time is about 400s, but in practice I seem to see about 100s, however the parallel performance matches very well.

My performance measurements where done on the Pi 1 Model B. Perhaps you're using a Pi 2 or a Pi 3? In your assignment, there are a few reasons you might like to also include results for the Pi 1 Model B (which should be more favourable):

  1. I think the Pi 1 Model B has the same chip as the much smaller Pi Zero, so for space constrained systems this is an important design point.

  2. I think the Pi 1 Model B has lower power consumption than the Pi 2 or Pi 3.

cbayley108 commented 6 years ago

Thanks for making that fix! I can talk with my professor about it, he'll likely allow me to send you the code for debugging purposes but it would have to be through email or something less public than Github. And that makes some sense in terms of the scalar run times; I am running on a Pi 3 model B which has a 1.2 Ghz processor compared to the Pi 1's 700 Mhz. I'm not sure where the factor of 4 difference is coming from as the clock is only about 2x faster, but I suppose this could be due to a number of differences between the Pi's.

mn416 commented 6 years ago

I can talk with my professor about it, he'll likely allow me to send you the code for debugging purposes

That would be great, if you're allowed.

I'm not sure where the factor of 4 difference is coming from as the clock is only about 2x faster, but I suppose this could be due to a number of differences between the Pi's.

A couple of other factors, beyond just clock frequency: the cache sizes are different, and probably the CPI (cycles per instruction) too.