mn416 / QPULib

Language and compiler for the Raspberry Pi GPU
Other
429 stars 64 forks source link

Question: how to transfer values on the regfiles for Special Function Unit #43

Open wimrijnders opened 6 years ago

wimrijnders commented 6 years ago

Usage of the Special Function Unit (SFU) is something I would really like to add to QPULib.

Page 23 of th VideoCore Reference tells you how to do it:

This is simple in principle, but I would like to know the opcodes/commands to move values between different addresses in the regfiles. Would you mind explaining this to me?

The rest I think I can figure out using the QPULib DSL.


I have to say I'm a bit underwhelmed by the SFU. It appears to deal with one value at a time (as opposed to the regular blocks of 16) and on top of that there is the two-cycle wait. Still, I hope it is of some value.

mn416 commented 6 years ago

Hi @wimrijnders,

Accessing the SFU sounds very similar to accessing the TMU. You could follow the gather() and receive() functions through the compiler to see how it works.

wimrijnders commented 6 years ago

Yes, the current DSL makes it trivial to get values into the VideoCore. As the first step, I think the thing to do is load an immediate, run it through any function of the SFU and then return it.

What I don't know is what commands to use once the value is inside. Specifically:

wimrijnders commented 6 years ago

Scanning the document once again, I see a possibility.

I see fields add_a, mul_a etc as bit fields in the Instruction Encoding tabale (page 26). Scrolling down, in table 3 on page 28, I see that special register r4 can be selected here for read.

Slowly getting there....

wimrijnders commented 6 years ago

BTW, Looking at the instruction encoding, I notice that add and multiply can both be set in an instruction. This implies that add and mul can be performed in parallel.

Could you confirm this for me? If yes, cool!

mn416 commented 6 years ago

My suggestion to look at receive() and gather() was not just to see how the DSL syntax construction works. You will see the AST elements (LOAD_RECEIVE and TMU0_ADDR) used to represent these operations and, if you grep for these in the Source and Target dirs, you'll see how the compiler handles them. These are relevant to the SFU: gather() shows how to write to a special register and receive() shows how to read from r4.

And yes, parallel add and mul is possible.

wimrijnders commented 6 years ago

:+1: OK thanks for the tip, will do.

And yes, parallel add and mul is possible.

Woohoo! :sunglasses: Do you optimize for that already? Otherwise, there is much to be won here!

mn416 commented 6 years ago

Hi @wimrijnders,

Do you optimize for that already? Otherwise, there is much to be won here!

Sadly not! Would be cool if we did...

wimrijnders commented 6 years ago

Sadly not! Would be cool if we did...

That, my friend, I regard as great news. It can be made even faster! :sunglasses: Something to look at when current library code is stabilized, for example after a release to master.

mn416 commented 6 years ago

Indeed. I think the main bottleneck in QPULib is the inability for the programmer to cache arbitrary data in the VPM and repeatedly access it without having to keep going out to main memory. The refactoring work will give programmer full control of the VPM, if they want it. After that, we could look at higher-level wrappers around the low-level VPM, to keep it easy to use. And also, like you suggest, optimisations like parallel add/mul.

wimrijnders commented 6 years ago

OK. I almost understand what you mean. I certainly understand the issue; you're dealing with a limited amount of internal storage. If there is more data than fits in there, you need to think about maximizing efficiency by keeping in as much relevant data as possible.

I'll awaiting whatever you're creating.

wimrijnders commented 6 years ago

Is it OK to expand RegisterMap and detectPlatform? This won't touch any of the existing library code.