New example: Mandelbrot

wimrijnders commented 6 years ago

This adds the example program Mandelbrot, at least an initial version.

Structure taken from Rot3d
Scalar kernel only
The scalar kernel is structured in such a way that I hope makes sense for QPU kernels
Outputs a PGM bitmap with the result

Running times:

# Intel:
> obj-debug/bin/Mandelbrot 
0.869301s

# Pi 2:
> obj/bin/Mandelbrot 
13.470295s

The output PGM bitmap looks like this:

mandelbrot

wimrijnders commented 6 years ago

I really really couldn't resist. Never mind my actual work or other important tasks!

wimrijnders commented 6 years ago

I'm actually quite impressed with the running time of the Pi 2: it's only about 14 times slower than my 3GHz i7. I find that amazing for such a dinky processor.

wimrijnders commented 6 years ago

The goal as far as I'm concerned, is to have the VideoCore beat this i7 value. I want to see a $40 computer make mincemeat of my intel laptop.

wimrijnders commented 6 years ago

Added first version of a QPU kernel. This works with the emulator, not tested yet with hardware.

I must honestly say that the conversion from scalar to QPU was straightforward, congrats on that. Sincere feedback on my first kernel attempt is appreciated.

Also some code cleanup.

wimrijnders commented 6 years ago

I used the following construct on a hunch:

   BoolExpr condition = (radius < 4 && count < numiterations);

    While (any(condition))
      Where (condition)
   ...

I'm a bit suprised actually that it worked, it appears to be recalculated on every iteration as well. So BoolExpr works as a kind of lambda apparently.

Is this correct? Would you expect it to work propely like this? Or am I stretching the definitiion here? Keep in mind that I've only run it on an emulator. Perhaps there are some devious differences with QPU.

wimrijnders commented 6 years ago

Working on QPU! Execution with 0.215066s with 1 QPU, 192x192 points.

mandelbrot

Pardon my language, but I'm fucking impressed. WORKING! This DSL thing you crafted actually delivers.

drinks are on me if ever we meet.

I had to reduce the resolution to 192x192 (was 1024x1024), because otherwise you get a heap alloc error.

Time comparison:

Platform	Kernel	Time (s)
i7	scalar	0.032004
i7 emulator	1	1.303475
p2	scalar	0.470066
p2 emulator	1	26.990710
p2	1	0.215066

i7 scalar is still 7 times faster than p2 kernel 1. However, this is 1 QPU and completely unoptimized code.

mn416 commented 6 years ago

Woohoo! Really nice!!! I haven't looked at the code yet, but will do and offer suggestions if any come to mind. We should put this as one of the introductory examples in the README :)

BoolExpr condition = (radius < 4 && count < numiterations);

Yes, BoolExpr is an expression not a value -- it isn't actually evaluated at this point.

I'm looking forward to the 12 QPU version :)

mn416 commented 6 years ago

First suggestion: instead of

result[i] = count;

try

store(count, result+i);

The latter is non-blocking -- it doesn't wait until the store is complete before continuing execution.

mn416 commented 6 years ago

Second suggestion: try using gather and receive instead of array lookups for the loads. However, I'd probably try multiple QPUs first.

By the way, does Mandlebrot require the loads? Can the fractal be produced without reading any input arrays?

Despite these suggestions, I'm glad you implemented the non-optimised version first. It looks very neat.

wimrijnders commented 6 years ago

Second suggestion: try using gather and receive instead of array lookups for the loads. However, I'd probably try multiple QPUs first.

Okay, but that's for kernel 2. Right now, I'm more interested in getting the current code optimized. Deluge me with suggestions!

By the way, does Mandlebrot require the loads? Can the fractal be produced without reading any input arrays?

Well, no to first and yes to second question. It's entirely possible to initialize everything with the given parameters. But I haven't figured out how to that yet. All I've got is what I understand from GCD and Rot3D. Suggestions are welcome, otherwise have patience.

EDIT: Well, ....

wimrijnders commented 6 years ago

Here's the second iteration of the Mandelbrot kernel. It does away with the input arrays. The trick here was to understand the usage of index(). Also some code 'optimization' - between quotes because it didn't make one bit of difference.

Calculation with 192x192 points; 1.0 previous kernel, 1.1 new kernel

Platform	Kernel	Time (s)
i7	scalar	0.032004
Pi2	scalar	0.470066
Pi2	1.0	0.215066
Pi2	1.1	0.209823

Well, maybe a tiny bit. I think it's fair to say that this kernel is computation-bound, because removing the data transport does not make one bit of difference (do you agree?).

The nice thing about not having the input arrays is that the points can be scaled up again to 1024x1024:

Calculation with 1024x1024 points

Platform	Kernel	Time (s)
i7	scalar	0.869301
Pi2	scalar	13.712409
Pi2	1.0	couldn't run
Pi2	1.1	5.005170

Insight:

Expressions are code generators

Yes, BoolExpr is an expression not a value -- it isn't actually evaluated at this point.

I understand now. The generated code is inlined and therefore it's as if it's called as a lambda. I actually really like this serendipitous capability, you should use it as a selling point and formalize it in the documentation.

I used this construct for a further optimization. This double use of condition bothered me:

   BoolExpr condition = (radius < 4 && count < numiterations);

    While (any(condition))
      Where (condition)
   ...

... because with my new insight it's obvious that the condition got executed twice - overhead. So I used the same principle to tweak the condition to something that can be stored in a variable so that only that variable needs to be checked:

     FloatExpr condition = (4.0f - (reSquare + imSquare))*toFloat(numiterations - count);
     Float checkvar = condition;

     While (any(checkvar > 0))
       Where (checkvar > 0)

And there is a slight improvement:

Calculation with 1024x1024 points

Platform	Kernel	Time (s)
i7	scalar	0.869301
Pi2	scalar	13.712409
Pi2	1.1	5.005170
Pi2	1.1 tweaked	4.370737

wimrijnders commented 6 years ago

Observations and feature requests

Given:

Int a;
Float b;
Float c;

... the following don't work:

c = a+b;       // No operator for Int, Float combination
c = a*b;       // idem
c += b;        // operator doesn't exist
a = (b < 0);   // Can't assign result BoolExpr to Int

There are alternatives to the first three of course:

c = toFloat(a)+b;
c = toFloat(a)*b;
c = c + b;

But I personally would truly appreciate it if the initial versions worked. I can sort of understand if you want to have explicit casts, but still.

I hereby place a feature request for the given operators. Also the conversion of a BoolExpr result to Int.

wimrijnders commented 6 years ago

Also, a minor point, following does not work:

  store(count, result[index]);

I had to do it like this instead:

  store(count, result + index);

But TBH this is a small thing I can live with.

wimrijnders commented 6 years ago

I hope it goes without saying that any optimizations you can think of are appreciated. I want to embarrass the i7, but we're not close yet!

wimrijnders commented 6 years ago

And I'll repeat, I'm impressed with your efforts at making this work. I just starred your project, great work! Hope I can help to make it even better.

mn416 commented 6 years ago

I hope it goes without saying that any optimizations you can think of are appreciated.

I see you got rid of the loads, excellent. I agree, the kernel is now compute bound so should scale up to 12 QPUs without much hassle. I'm not saying we're getting optimal performance from a single QPU, but that is surely more the compiler's fault than the program's, in this case.

wimrijnders commented 6 years ago

Update, initial 12 QPU version:

Calculation with 1024x1024 points, 2 is multi-QPU kernel

Platform	Kernel	Time (s)
i7	scalar	0.869301
Pi2	scalar	13.712409
Pi2	1.1	4.370737
Pi2	2	2.187408

:-( I'm just so intensely disappointed right now. I'll see if I can tweak it further, then I'll commit for your insights.

wimrijnders commented 6 years ago

Calculation with 1024x1024 points, Pi2, kernel 2 - multi-QPU

Num QPU's	Time (s)
1	4.427383
2	3.272469
3	2.755362
6	2.492874
12	2.192445

screenshot from 2018-07-19 10-39-46

Not linear with num QPU's as I was expecting.....

mn416 commented 6 years ago

I can only imagine that there is a bottleneck created by the store function. What if you remove all calls to store, how well does the scaling work then?

I'm starting to think that the VideoCore doesn't like the way I am using the DMA unit -- lots of single-vector DMA requests. If so, this is a good thing to learn because it is probably also the bottleneck in other QPULib examples, and it is fixable.

wimrijnders commented 6 years ago

Questions/Requests

Is there any difference between If() and Where()? They appear to do the same thing.
Would a Continue and/or Break statement be possible? E.g.

While (condition)                // Also For(); any loops
  If (condition2) Continue; End  // Do next iteration loop
  If (condition3) Break;  End    // Exit the loop
End

Same question for Return. Also, to break of a sub-generator, eg:

void func_1() {
...
  Return;  // Get out of current generator
...
}

void func_2() {
...
  func_1();
...
}

An Exit would also be nice, but I think that's what kernelFinish() does.

wimrijnders commented 6 years ago

What if you remove all calls to store, how well does the scaling work then?

No difference.

mn416 commented 6 years ago

Is there any difference between If() and Where()? They appear to do the same thing

Where allows assignment to a subset of elements of a vector, where that subset is determined by the condition.

If executes different code depending on the condition.

wimrijnders commented 6 years ago

I commited the last changes:

Added a multi-QPU core
DRY'd common code for the kernels in mandelbrotCore()
raised the heap size, otherwise issues with running the emulator

Right now I'm hoping for a duh-moment where you point out some obvious error to me.

wimrijnders commented 6 years ago

Where allows assignment to a subset of elements of a vector, where that subset is determined by the condition.

Sorry, don't get it. Example to point out difference?

I'm stopping now, wasted[1] too much time on this already. I should be working right now!

[1] 'wasted' being a relative term. I'm having loads of fun doing this.

mn416 commented 6 years ago

Sorry, don't get it. Example to point out difference?

Where (x > 10) x++; End

Above increments the elements of vector x that are larger than 10.

If (any(x > 10)) x++; End

Above increments every element of x only when at least one of its elements is larger than 10.

wimrijnders commented 6 years ago

OK so far. What would this do?

  If (x > 10) x++; End

mn416 commented 6 years ago

If (x > 10) x++; End

Increments all elements of x if the first element is larger than 10.

EDIT:

Looking at the source code, it is actually just a shorthand for If (any (x > 10)) x++; End.

wimrijnders commented 6 years ago

Ah, right. Moment of insight here.

So let's see if I got this right:

Given a 16-vector v[16]
if for the processing for v[0] the condition (x > 0) occurs
Then in all the processing of v[0] .. v[15] x will be incremented.

Correct?

This probably means that the kernel 2 is wrong, since I'm using If. Can't resist checking....

mn416 commented 6 years ago

Correct?

Yes, but note my "EDIT". instead of

if for the processing for v[0] the condition (x > 0) occurs

I would say

if for the processing for v the condition (x > 0) holds for any element of v

mn416 commented 6 years ago

Right now I'm hoping for a duh-moment

A tentative theory which is probably wrong but worth considering:

Could there be a load balancing issue where one core ends up doing a lot more work than the others, due to the way the problem is partitioned?

How well are the counts distibuted? Do some lines have far higher counts than others?

wimrijnders commented 6 years ago

Adapted from last commit:

      Where (resultIndex < (numStepsWidth*numStepsHeight))
        mandelbrotCore(
          (topLeftReal + offsetX*toFloat(xIndex)),
          (topLeftIm - toFloat(yIndex)*offsetY),
          resultIndex,
          numiterations,
          result);
      End
    End

Gives runtime error:

QPULib: only assignments and nested 'where'           statements can occur in a 'where' statement
Mandelbrot: Lib/Source/Translate.cpp:873: void QPULib::whereStmt(QPULib::Seq<QPULib::Instr>*, QPULib::Stmt*, QPULib::Var, QPULib::AssignCond, bool): Assertion `false' failed.
Aborted

That's why I used If here. Any chance of a resolution?

wimrijnders commented 6 years ago

Looking at the source code, it is actually just a shorthand for If (any (x > 10)) x++; End.

OK, thanks. Good to know. That mean my usage of If is still wrong in last commit.

wimrijnders commented 6 years ago

Could there be a load balancing issue where one core ends up doing a lot more work than the others, due to the way the problem is partitioned?

AFAIK, no, all cores do exactly the same amount of work. But then again, I'm a newbie, I can't guarantee this.

mn416 commented 6 years ago

That's why I used If here. Any chance of a resolution?

So Where is basically conditional assigment: it will assign to vector elements that satisfy a condition. That's why I only allow assignment statements to occur inside a Where block. I'm not really sure what the semantics of a Where with a While loop inside it would be.

wimrijnders commented 6 years ago

I'm not really sure what the semantics of a Where with a While loop inside it would be.

Heh :smile: I'm pushing the boundaries here. You have a think about, perhaps it will improve the code later.

wimrijnders commented 6 years ago

I thought up a workaround:

  For (Int dummy = 0, dummy < 1 && (resultIndex < (numStepsWidth*numStepsHeight)), dummy++)
  //Where (resultIndex < (numStepsWidth*numStepsHeight))
  ...
  End

Still, no difference in execution time.

wimrijnders commented 6 years ago

@mn416 Well, I can explain part of the timing. The generation of the bitmap is within the timing, and it takes quite a long time (about 1.5s).

Moving it past the timersub() call gives as profile time: 0.218960s

wimrijnders commented 6 years ago

Calculation with 1024x1024 points, Pi2, kernel 2 - multi-QPU

Num QPU's	Time (s)
1	2.402123
2	1.201396
3	0.805896
6	0.414084
9	0.283850
12	0.218916

screenshot from 2018-07-19 12-57-14

Better. I'd like to see a linear graph though.

And note that this beats the i7 score of 0.869301s. Yay! :tada:

EDIT: Apologies for being an idiot. So obvious....

mn416 commented 6 years ago

Ahh, that makes more sense. Cool. I am guessing that store is the bottleneck now, can you verify?

wimrijnders commented 6 years ago

12 QPU's.

store(count, result + resultIndex: 0.218935s
result[resultIndex] = count: 0.220662s
No write to result at all: 0.199739s

Not really much difference

mn416 commented 6 years ago

The results you are seeing are linear if you plot num QPUs versus speedup factor.

wimrijnders commented 6 years ago

The results you are seeing are linear if you plot num QPUs versus speedup factor.

And this is how that looks like:

screenshot from 2018-07-19 13-16-16

Something to be satisfied about I think.

mn416 commented 6 years ago

Something to be satisfied about I think.

Definitely. The first QPULib example to show strong scaling :)

mn416 commented 6 years ago

void mandelbrotCore(
  Float reC, Float imC,
  Int resultIndex,
  Int numiterations,
  Ptr<Int> &result)

You might try taking resultIndex and numIterations as references, to avoid unnecessary copying. As I've said before, QPULib doesn't do many optimisations.

wimrijnders commented 6 years ago

That was the last commit, some minor cleanup. Right now, I don't have any more bright ideas on how to make it better. Please final review?

mn416 commented 6 years ago

Ideally mandlebrot_2 would look as close as possible to the scalar version, and then the two could be placed side-by-side in the tutorial. One way to do this would be to inline mandlebrotCore and try to get rid of the strange looking dummy loop.

These are only suggestions, happy to accept the PR as it is too.

wimrijnders commented 6 years ago

The mandbrodCore is a DRY on the code....would hesitate to remove it again. A case can be made to put it as such in the tutorial, you can describe it separately.

As for the If (dummy..., this is the best I could think of. Open for suggestions. Can this be done better?

wimrijnders commented 6 years ago

I do agree that the dummy if is stupid. Ideally, a When should be there but semantics forbid it.

wimrijnders commented 6 years ago

I found a solution for If(dummy.... Since a line in the mandelbrot is done completely by a single QPU, it's sufficient to test yIndex only. See the code diff.

Further Changes:

code cleanup
init value for count in output_pgm()
aligned kernel 1 and kernel 2 code - kernel 1 showed differences in output, probably due to incremental differences in calculation

I tested this both kernel 1 and 2 and with different numbers of QPU's (especially odd numbers), the output bitmap is now always the same.

wimrijnders commented 6 years ago

@mn416 Heh. There is a competitor.

I wonder how our implementation compares to that one. I'll check in a spare moment.

Tested on Pi 2, kernel 2, 12 QPU's, same parameters for mandelbrot generation as link above.

Competitor's time: 33.781s

Run	Time(s)	comment
1	9.997932	Message: 'Failed to invoke kernel on QPUs'
2	31.794590	this is good!
3	0.000137	This can't be right at all. Something went wrong with the scheduling?
4	48.802330	:-(
5	0.000137
6	51.746103
7	0.000139
8	36.522121

There's a pattern here. The first call fails in some way. The second call succeeds, but the times are highly variable.

Do you have any idea what can cause this?

Also, see the output bitmap:

mandelbrot_big

The output is 1920x1080, 5MB. I couldn't load it into GIMP so I made a screenshot and scaled it down.

Not sure where this comes from. I'm not really expecting the calculation itself to be in error (however, see error message above). It's probably more likely to do with the pgm generation.

mn416 / QPULib