Closed wimrijnders closed 6 years ago
I really really couldn't resist. Never mind my actual work or other important tasks!
I'm actually quite impressed with the running time of the Pi 2: it's only about 14 times slower than my 3GHz i7. I find that amazing for such a dinky processor.
The goal as far as I'm concerned, is to have the VideoCore beat this i7 value. I want to see a $40 computer make mincemeat of my intel laptop.
Added first version of a QPU kernel. This works with the emulator, not tested yet with hardware.
I must honestly say that the conversion from scalar to QPU was straightforward, congrats on that. Sincere feedback on my first kernel attempt is appreciated.
Also some code cleanup.
I used the following construct on a hunch:
BoolExpr condition = (radius < 4 && count < numiterations);
While (any(condition))
Where (condition)
...
I'm a bit suprised actually that it worked, it appears to be recalculated on every iteration as well. So BoolExpr
works as a kind of lambda apparently.
Is this correct? Would you expect it to work propely like this? Or am I stretching the definitiion here? Keep in mind that I've only run it on an emulator. Perhaps there are some devious differences with QPU.
Working on QPU! Execution with 0.215066s with 1 QPU, 192x192 points.
Pardon my language, but I'm fucking impressed. WORKING! This DSL thing you crafted actually delivers.
drinks are on me if ever we meet.
I had to reduce the resolution to 192x192 (was 1024x1024), because otherwise you get a heap alloc error.
Time comparison:
Platform | Kernel | Time (s) |
---|---|---|
i7 | scalar | 0.032004 |
i7 emulator | 1 | 1.303475 |
p2 | scalar | 0.470066 |
p2 emulator | 1 | 26.990710 |
p2 | 1 | 0.215066 |
i7 scalar is still 7 times faster than p2 kernel 1. However, this is 1 QPU and completely unoptimized code.
Woohoo! Really nice!!! I haven't looked at the code yet, but will do and offer suggestions if any come to mind. We should put this as one of the introductory examples in the README :)
BoolExpr condition = (radius < 4 && count < numiterations);
Yes, BoolExpr is an expression not a value -- it isn't actually evaluated at this point.
I'm looking forward to the 12 QPU version :)
First suggestion: instead of
result[i] = count;
try
store(count, result+i);
The latter is non-blocking -- it doesn't wait until the store is complete before continuing execution.
Second suggestion: try using gather
and receive
instead of array lookups for the loads. However, I'd probably try multiple QPUs first.
By the way, does Mandlebrot require the loads? Can the fractal be produced without reading any input arrays?
Despite these suggestions, I'm glad you implemented the non-optimised version first. It looks very neat.
Second suggestion: try using gather and receive instead of array lookups for the loads. However, I'd probably try multiple QPUs first.
Okay, but that's for kernel 2. Right now, I'm more interested in getting the current code optimized. Deluge me with suggestions!
By the way, does Mandlebrot require the loads? Can the fractal be produced without reading any input arrays?
Well, no to first and yes to second question. It's entirely possible to initialize everything with the given parameters. But I haven't figured out how to that yet. All I've got is what I understand from GCD
and Rot3D
. Suggestions are welcome, otherwise have patience.
EDIT: Well, ....
Here's the second iteration of the Mandelbrot kernel. It does away with the input arrays. The trick here was to understand the usage of index()
.
Also some code 'optimization' - between quotes because it didn't make one bit of difference.
Calculation with 192x192 points; 1.0 previous kernel, 1.1 new kernel
Platform | Kernel | Time (s) |
---|---|---|
i7 | scalar | 0.032004 |
Pi2 | scalar | 0.470066 |
Pi2 | 1.0 | 0.215066 |
Pi2 | 1.1 | 0.209823 |
Well, maybe a tiny bit. I think it's fair to say that this kernel is computation-bound, because removing the data transport does not make one bit of difference (do you agree?).
The nice thing about not having the input arrays is that the points can be scaled up again to 1024x1024:
Calculation with 1024x1024 points
Platform | Kernel | Time (s) |
---|---|---|
i7 | scalar | 0.869301 |
Pi2 | scalar | 13.712409 |
Pi2 | 1.0 | couldn't run |
Pi2 | 1.1 | 5.005170 |
Yes, BoolExpr is an expression not a value -- it isn't actually evaluated at this point.
I understand now. The generated code is inlined and therefore it's as if it's called as a lambda. I actually really like this serendipitous capability, you should use it as a selling point and formalize it in the documentation.
I used this construct for a further optimization. This double use of condition bothered me:
BoolExpr condition = (radius < 4 && count < numiterations);
While (any(condition))
Where (condition)
...
... because with my new insight it's obvious that the condition got executed twice - overhead. So I used the same principle to tweak the condition to something that can be stored in a variable so that only that variable needs to be checked:
FloatExpr condition = (4.0f - (reSquare + imSquare))*toFloat(numiterations - count);
Float checkvar = condition;
While (any(checkvar > 0))
Where (checkvar > 0)
And there is a slight improvement:
Calculation with 1024x1024 points
Platform | Kernel | Time (s) |
---|---|---|
i7 | scalar | 0.869301 |
Pi2 | scalar | 13.712409 |
Pi2 | 1.1 | 5.005170 |
Pi2 | 1.1 tweaked | 4.370737 |
Given:
Int a;
Float b;
Float c;
... the following don't work:
c = a+b; // No operator for Int, Float combination
c = a*b; // idem
c += b; // operator doesn't exist
a = (b < 0); // Can't assign result BoolExpr to Int
There are alternatives to the first three of course:
c = toFloat(a)+b;
c = toFloat(a)*b;
c = c + b;
But I personally would truly appreciate it if the initial versions worked. I can sort of understand if you want to have explicit casts, but still.
I hereby place a feature request for the given operators. Also the conversion of a BoolExpr
result to Int
.
Also, a minor point, following does not work:
store(count, result[index]);
I had to do it like this instead:
store(count, result + index);
But TBH this is a small thing I can live with.
I hope it goes without saying that any optimizations you can think of are appreciated. I want to embarrass the i7, but we're not close yet!
And I'll repeat, I'm impressed with your efforts at making this work. I just starred your project, great work! Hope I can help to make it even better.
I hope it goes without saying that any optimizations you can think of are appreciated.
I see you got rid of the loads, excellent. I agree, the kernel is now compute bound so should scale up to 12 QPUs without much hassle. I'm not saying we're getting optimal performance from a single QPU, but that is surely more the compiler's fault than the program's, in this case.
Update, initial 12 QPU version:
Calculation with 1024x1024 points, 2 is multi-QPU kernel
Platform | Kernel | Time (s) |
---|---|---|
i7 | scalar | 0.869301 |
Pi2 | scalar | 13.712409 |
Pi2 | 1.1 | 4.370737 |
Pi2 | 2 | 2.187408 |
:-( I'm just so intensely disappointed right now. I'll see if I can tweak it further, then I'll commit for your insights.
Calculation with 1024x1024 points, Pi2, kernel 2 - multi-QPU
Num QPU's | Time (s) |
---|---|
1 | 4.427383 |
2 | 3.272469 |
3 | 2.755362 |
6 | 2.492874 |
12 | 2.192445 |
Not linear with num QPU's as I was expecting.....
I can only imagine that there is a bottleneck created by the store
function. What if you remove all calls to store
, how well does the scaling work then?
I'm starting to think that the VideoCore doesn't like the way I am using the DMA unit -- lots of single-vector DMA requests. If so, this is a good thing to learn because it is probably also the bottleneck in other QPULib examples, and it is fixable.
If()
and Where()
? They appear to do the same thing.Continue
and/or Break
statement be possible? E.g.While (condition) // Also For(); any loops
If (condition2) Continue; End // Do next iteration loop
If (condition3) Break; End // Exit the loop
End
Return
. Also, to break of a sub-generator, eg:void func_1() {
...
Return; // Get out of current generator
...
}
void func_2() {
...
func_1();
...
}
Exit
would also be nice, but I think that's what kernelFinish()
does.What if you remove all calls to store, how well does the scaling work then?
No difference.
Is there any difference between If() and Where()? They appear to do the same thing
Where
allows assignment to a subset of elements of a vector, where that subset is determined by the condition.
If
executes different code depending on the condition.
I commited the last changes:
mandelbrotCore()
Right now I'm hoping for a duh-moment where you point out some obvious error to me.
Where allows assignment to a subset of elements of a vector, where that subset is determined by the condition.
Sorry, don't get it. Example to point out difference?
I'm stopping now, wasted[1] too much time on this already. I should be working right now!
[1] 'wasted' being a relative term. I'm having loads of fun doing this.
Sorry, don't get it. Example to point out difference?
Where (x > 10) x++; End
Above increments the elements of vector x
that are larger than 10.
If (any(x > 10)) x++; End
Above increments every element of x
only when at least one of its elements is larger than 10.
OK so far. What would this do?
If (x > 10) x++; End
If (x > 10) x++; End
Increments all elements of x
if the first element is larger than 10.
EDIT:
Looking at the source code, it is actually just a shorthand for If (any (x > 10)) x++; End
.
Ah, right. Moment of insight here.
So let's see if I got this right:
v[16]
Correct?
This probably means that the kernel 2 is wrong, since I'm using If
. Can't resist checking....
Correct?
Yes, but note my "EDIT". instead of
if for the processing for v[0] the condition (x > 0) occurs
I would say
if for the processing for v the condition (x > 0) holds for any element of v
Right now I'm hoping for a duh-moment
A tentative theory which is probably wrong but worth considering:
Could there be a load balancing issue where one core ends up doing a lot more work than the others, due to the way the problem is partitioned?
How well are the counts distibuted? Do some lines have far higher counts than others?
Adapted from last commit:
Where (resultIndex < (numStepsWidth*numStepsHeight))
mandelbrotCore(
(topLeftReal + offsetX*toFloat(xIndex)),
(topLeftIm - toFloat(yIndex)*offsetY),
resultIndex,
numiterations,
result);
End
End
Gives runtime error:
QPULib: only assignments and nested 'where' statements can occur in a 'where' statement
Mandelbrot: Lib/Source/Translate.cpp:873: void QPULib::whereStmt(QPULib::Seq<QPULib::Instr>*, QPULib::Stmt*, QPULib::Var, QPULib::AssignCond, bool): Assertion `false' failed.
Aborted
That's why I used If
here. Any chance of a resolution?
Looking at the source code, it is actually just a shorthand for If (any (x > 10)) x++; End.
OK, thanks. Good to know. That mean my usage of If
is still wrong in last commit.
Could there be a load balancing issue where one core ends up doing a lot more work than the others, due to the way the problem is partitioned?
AFAIK, no, all cores do exactly the same amount of work. But then again, I'm a newbie, I can't guarantee this.
That's why I used If here. Any chance of a resolution?
So Where
is basically conditional assigment: it will assign to vector elements that satisfy a condition. That's why I only allow assignment statements to occur inside a Where
block. I'm not really sure what the semantics of a Where
with a While
loop inside it would be.
I'm not really sure what the semantics of a Where with a While loop inside it would be.
Heh :smile: I'm pushing the boundaries here. You have a think about, perhaps it will improve the code later.
I thought up a workaround:
For (Int dummy = 0, dummy < 1 && (resultIndex < (numStepsWidth*numStepsHeight)), dummy++)
//Where (resultIndex < (numStepsWidth*numStepsHeight))
...
End
Still, no difference in execution time.
@mn416 Well, I can explain part of the timing. The generation of the bitmap is within the timing, and it takes quite a long time (about 1.5s).
Moving it past the timersub()
call gives as profile time: 0.218960s
Calculation with 1024x1024 points, Pi2, kernel 2 - multi-QPU
Num QPU's | Time (s) |
---|---|
1 | 2.402123 |
2 | 1.201396 |
3 | 0.805896 |
6 | 0.414084 |
9 | 0.283850 |
12 | 0.218916 |
Better. I'd like to see a linear graph though.
And note that this beats the i7 score of 0.869301s. Yay! :tada:
EDIT: Apologies for being an idiot. So obvious....
Ahh, that makes more sense. Cool. I am guessing that store
is the bottleneck now, can you verify?
12 QPU's.
store(count, result + resultIndex
: 0.218935sresult[resultIndex] = count
: 0.220662sresult
at all: 0.199739sNot really much difference
The results you are seeing are linear if you plot num QPUs versus speedup factor.
The results you are seeing are linear if you plot num QPUs versus speedup factor.
And this is how that looks like:
Something to be satisfied about I think.
Something to be satisfied about I think.
Definitely. The first QPULib example to show strong scaling :)
void mandelbrotCore(
Float reC, Float imC,
Int resultIndex,
Int numiterations,
Ptr<Int> &result)
You might try taking resultIndex and numIterations as references, to avoid unnecessary copying. As I've said before, QPULib doesn't do many optimisations.
That was the last commit, some minor cleanup. Right now, I don't have any more bright ideas on how to make it better. Please final review?
Ideally mandlebrot_2
would look as close as possible to the scalar version, and then the two could be placed side-by-side in the tutorial. One way to do this would be to inline mandlebrotCore
and try to get rid of the strange looking dummy
loop.
These are only suggestions, happy to accept the PR as it is too.
The mandbrodCore
is a DRY on the code....would hesitate to remove it again. A case can be made to put it as such in the tutorial, you can describe it separately.
As for the If (dummy...
, this is the best I could think of. Open for suggestions. Can this be done better?
I do agree that the dummy if is stupid. Ideally, a When
should be there but semantics forbid it.
I found a solution for If(dummy...
. Since a line in the mandelbrot is done completely by a single QPU, it's sufficient to test yIndex
only. See the code diff.
Further Changes:
count
in output_pgm()
I tested this both kernel 1 and 2 and with different numbers of QPU's (especially odd numbers), the output bitmap is now always the same.
@mn416 Heh. There is a competitor.
I wonder how our implementation compares to that one. I'll check in a spare moment.
Tested on Pi 2, kernel 2, 12 QPU's, same parameters for mandelbrot generation as link above.
Competitor's time: 33.781s
Run | Time(s) | comment |
---|---|---|
1 | 9.997932 | Message: 'Failed to invoke kernel on QPUs' |
2 | 31.794590 | this is good! |
3 | 0.000137 | This can't be right at all. Something went wrong with the scheduling? |
4 | 48.802330 | :-( |
5 | 0.000137 | |
6 | 51.746103 | |
7 | 0.000139 | |
8 | 36.522121 |
There's a pattern here. The first call fails in some way. The second call succeeds, but the times are highly variable.
Do you have any idea what can cause this?
Also, see the output bitmap:
The output is 1920x1080, 5MB. I couldn't load it into GIMP so I made a screenshot and scaled it down.
Not sure where this comes from. I'm not really expecting the calculation itself to be in error (however, see error message above). It's probably more likely to do with the pgm generation.
This adds the example program Mandelbrot, at least an initial version.
Rot3d
Running times:
The output PGM bitmap looks like this: