Investigate offloading capture to GPU

hoglet67 commented 2 years ago

(This thread started in PMs on stardot with some questions from Ian)

How do you make the vasm assembler? I tried running make but just got an error. (I'm using linux prompt under windows 10)

Try:

make CPU=vidcore SYNTAX=std

After that, I assume I need to run vidcore/build.sh ?

Yes, that will then update a file in the source tree. You'll need to change that path to suit RGBTOHDMI.

Is there any documentation on the instruction set?

https://github.com/hermanhermitage/videocoreiv/wiki/VideoCore-IV-Programmers-Manual

Also are the addresses the same as in the Arm?

Not quite, the top two bits control how the VPU caches the access. See https://datasheets.raspberrypi.com/bcm2835/bcm2835-peripherals.pdf#page=5

i.e. do I pass the Arm address as the execution address?

Yes

What about the GPIO read address?

The peripherals are mapped to start at 0x7e00000 on the VPU.

Where in memory should I put the GPU code?

Any where you like. In PiTubeDirect it ends up in normal cached memory.

IanSB commented 2 years ago

I implemented the GPU GPIO test code and got the following results on the Pi zero 2W: GPU GPIO read = 34 ns, ARM GPIO read = 60 ns, Cached read = 2 ns, Screen read = 116 ns With the pi zero I get: GPU GPIO read = 35 ns, ARM GPIO read = 48 ns, Cached read = 3 ns, Screen read = 101 ns

Here is the test code:

ARM:  
        ldr    r4, =GPLEV0
        ldr    r1, =1000000
gbenchloop:
        ldr    r8, [r4]
        subs   r1, r1, #1
        bne    gbenchloop

GPU:

.equ GPFSEL0,       0x7e200000
.equ GPLEV0_offset, 0x34

   mov    r2, 1000000
   mov    r1, GPFSEL0   
readloop:
   ld     r3, GPLEV0_offset(r1)    
   sub    r2, 1
   cmp    r2, 0
   bne    readloop  
   rts

So now I need to replicate the second ARM core code in the GPU but what is the best way to communicate the data between the GPU and the ARM. The options are:

Use shared memory similar to the two ARM cores. Is there any cache consistency between the ARM and the GPU?
Use the mailboxes Are these slower than memory?

hoglet67 commented 2 years ago

Use shared memory similar to the two ARM cores. Is there any cache consistency between the ARM and the GPU?

Honestly, I'm a bit confused on this myself, but it's certainly not transparent.

When we use the mailbox interface in PiTubeDirect, the shared buffer is placed in cached memory, and the ARM code uses "data cache clean and invalidate by memory virtual address" to make the buffer visible to the GPU.

Use the mailboxes Are these slower than memory?

The mailbox hardware is essentially just a 32-bit wide register in each direction (one for ARM->GPU the other for GPU->ARM). When the register is written by one side, it causes an interrupt on the other side. There register is only 1-word deep (i.e. it's not a FIFO), so only one word can be transferred at at time. Typically a multi-word message is written to a buffer memory, and the address of the buffer is written into the mailbox register.

Unfortunately there is only one such mailbox, and this is used by the Broadcom Firmware Blob, so is not directly useable by an application.

It addition to the mailbox hardware, there is also doorbell hardware, which is like the mailbox, but without the 32-bit register. So it's really just a way of raising an interrupt on the other side.

In PiTubeDirect, the code supports two mechanisms for communications between the ARM and GPU (configurable at compile time).

Mailboxes

In this approach we upload the GPU code onto VPU Core 0 (the one running the Broadcom Blob). Because this disables interrupts and never exits, it effectively replaces the Broadcom Blob. This can then take over the mailbox hardware.

The downside is that non of the facilities provided by the Broadcom Blob are available any more. For example, thermal throttling, changing screen modes, etc.

Doorbells

In this approach we upload the GPU code onto VPU Core 1, and leave the Broadcom Blob running on VPU 0. The GPU code sits in a polling loop waiting for the doorbell to ring. We wanted the doorbell to work like the mailbox, so we needed to borrow a 32-bit word in the peripheral to hold the buffer pointer. We just picked an unused 32-bit register:

.equ GPU_ARM_DBELLDATA, 0x7E20C014   # Hijack PWM_DAT1 for Doorbell1 Data

This is just one of the PWM counters, which is available because the PWM peripheral is not used by PiTubeDirect

From hognose onwards we will be using doorbells.

The other thing to bear in mind is that the existing EXECUTE_CODE and LAUNCH_VPU1 mailbox calls take a function address and several parameters. So it's possible to use this to pass parameters from the ARM to the GPU.

Question: What are you thinking should be off-loaded to the GPU? The capture of a single line, or the capture of a whole frame?

IanSB commented 2 years ago

Question: What are you thinking should be off-loaded to the GPU? The capture of a single line, or the capture of a whole frame?

Ideally I think a capture of a line of raw GPIO data just like the current second core. This would leave the Arm to do the rendering which would give some scope for image processing (like improved NTSC artifact decoding) without having to ensure that each render loop was always short enough to catch the next psync edge. (The average time would have to match a line but individual loops could have significantly variable timing which isn't possible with capture and render in the same loop).

I've been doing some timing of register reads and it looks like reading any register takes 60ns on the zero2 Arm (including your PWM_DAT1 above)

One other possible large area of shared memory is the display list RAM: This is a block of 16 kilobytes of RAM in the register area which holds the display lists. It might be possible to use a small area of that at the end as the display lists don't fill it all.

However reading that is just as slow as reading the other registers, i.e 60ns.

BUT using LDM does produce a significant speedup so using LDM to read 4 locations takes 67ns or an average of ~16ns per word which should work. (An LDM of 8 words takes 76 ns or 9.5ns per word)

I'm going to so some cache consistency tests to see if ordinary shared memory is possible but the display list RAM looks viable as well.

IanSB commented 2 years ago

@hoglet67 There doesn't appear to be any cache consistency between the GPU and the ARM. If I write a value to the uncached screen area in the GPU, I can see that value in the ARM. If I write a value to the cached screen area in the GPU, I can't see that value in the ARM if I had previously read the same location to preload the cache at that location.

So I think I'm going to try to get something working using the display list RAM as that is faster than uncached memory using a very simple GPU routine to capture the GPIO state during psync changes and write that in blocks of 4 words to the display list RAM.

This GPU code might need to be left running all the time so how do I run code in core 1?

hoglet67 commented 2 years ago

This GPU code might need to be left running all the time so how do I run code in core 1?

You can do that with the LAUNCH_VP1 call:

    TAG_LAUNCH_VPU1 = 0x30013,

As far as I understand, it returns when the code on the VP1 core has been started. That code never exits.

IanSB commented 2 years ago

The average calling overhead for EXECUTE_CODE is 607 ns which makes it impractical to call at the beginning of every line

IanSB commented 2 years ago

An update:

I've got this working in principle but it's not usable at the moment:

I tried various options for passing data between the cores but nothing is really suitable. Using uncached memory is too slow Using the display list RAM is not possible because the RAM contents randomly return 0xff000000 instead of the actual content when read. Reading again will usually result in the right value. (Maybe some contention when the display list is being processed?) Using peripheral registers like the PWM one used in PiTubeDirect does work but reading the registers is just as slow as reading the GPIOs (~60ns on zero2W, ~48ns on Zero) and that introduces too much delay.

I managed to get some improvement by packing two 12 bit samples into one 32 bit register write but it isn't enough. I tried using two registers to pass pixel data with the intention of using LDM or LDRD to read those two words but neither of those instructions seem to work on the register memory locations. Executing them results in the target address being loaded into all registers with no increment i.e. ldmia r11, {r8, r9} gets the R8 value in R9 as well same with ldrd r8, r9, [r11]

It works for 3bpp and 6bpp modes with the exception that there is a one line glitch drifting up or down the screen which is indicative that one psync pulse is being missed. This might be caused due to contention beween both sides polling the registers.

I'm going to try reading and rendering in the GPU but I'm not too hopefull as there are no burst write instructions on the GPU for fast writing to the uncached screen. There is an STM to the stack but that seemed to take as long as writing the individual words to uncached memory. (There are also some vector memory instructions which could be investigated)

hoglet67 commented 2 years ago

I've brought this thread to the attention of @dp111

So what you currently have is:

GPU (Capture code) -> Cached Memory -> ARM (rendering code) -> Uncached Framebuffer

I also wondered if the vector instructions be used to do everything on the GPU.

I'm sure there are lots of options here.

IanSB commented 2 years ago

Actually it's

GPU (Capture code) -> Unused IO registers -> ARM (rendering code) -> Uncached Framebuffer

Looking at the timings, I still think there is a good chance to get this fully working but I could do with more registers. Are there any large blocks of unused registers suitable? I think I would need at least 4 or maybe 8 sequential or even more

hoglet67 commented 2 years ago

Are there any large blocks of unused registers suitable? I think I would need at least 4 or maybe 8 sequential or even more On PiTubeDirect we use a block of 8 registers from this address:
#define GPU_TUBE_REG_ADDR 0x7e0000a0
In the BCM2835 page they are called MS_MBOX_0 .. MS_MBOX_7

I'm not sure we every worked out which peripheral these were part of.

IanSB commented 2 years ago

Thanks, Using those registers fixed the strange line glitch I've been seeing. It also looks like it will work at 12bpp as it isn't dropping any psyncs but I'm seeing some edge noise which I think is due to the register occasionally getting updated while being read as I don't have a large enough FIFO buffer yet.

IanSB commented 2 years ago

I figured out what was happening with the ldm instruction:

If you have 4 i/o registers at locations x00, x04, x08, x0c and ldm r11, {r0, r1, r2, r3) you end up with the contents of x00 in r0 & r1 and the contents of x08 in r2 and r3

I assume this is because the ldm is reading the 64 bit memory data bus and putting the low and high 32 bit words of that bus in adjacent registers. In the case of the I/O registers they are only 32 bit wide and the contents are duplicated in the low and high 32 bit words of the memory data bus.

However if you do an unaligned read of two adjacent I/O registers they do end up in the correct ARM registers i.e. if you do ldmia r11, {r8,r9} with r11 pointing to x04 you end up with the contents of x04 in r8 and the contents of x08 in r9 and this is significantly faster than doing two separate ldrs of x04 and x08. This means I can read 4 samples in 64 bits very quickly compared to reading a single sample from the GPIO register.

IanSB commented 2 years ago

Well I've got that all working on the zero 2W and it outperforms the original pi zero now: 14.3 Mhz Amiga pixel clock: capture9 This is with double height capture and variable intensity scanlines which required an overclock with the original Pi zero of cpu=40Mhz & core=140Mhz With the GPU capture no overclock is required at all!

16 Mhz Atari ST pixel clock: capture0 This also required an overclock of core=110Mhz on the Pi zero but only a mild overclock on the GPU of core=50Mhz Also double height capture for variable intensity scanlines works which was impossible on the pi zero.

I tried the code on an original pi zero and it does work and I get an image but it locks up after a while so some further investigation is needed there

hoglet67 commented 2 years ago

Great progress Ian.

IanSB commented 2 years ago

I sorted out the Pi zero lockup issue and it works just as well on the original zero with the same lower overclock requirements. Also another benefit: The zero 2W runs 10 degrees cooler compared to using the second Arm core for capture.

BTW are macros with local labels supported in vasm?

hoglet67 commented 2 years ago

BTW are macros with local labels supported in vasm?

According to the standard syntax section of the manual: http://sun.hasenbraten.de/vasm/release/vasm_3.html#Standard-Syntax-Module

Either a .prefix or a $ suffix to the label.

IanSB commented 2 years ago

In the macro documentation it states: The special argument \@ inserts a unique id, useful for defining labels But when I try the following macros:

.macro CAPTURE_PSYNC_LO
wait_psync_lo\@:
   ld     r0, (r4)
   btst   r0, PSYNC_BIT
   bne    wait_psync_lo\@
   and    r0, r7
.endm

.macro CAPTURE_PSYNC_HI
wait_psync_hi\@:
   ld     r1, (r4)
   btst   r1, PSYNC_BIT
   beq    wait_psync_hi\@
   and    r1, r7
.endm

I get the error:

error 2 in line 1 of "capture_psync_lo": unknown mnemonic <wait_psync_lo1@:>
        called from line 74 of "tubevc.s"
>wait_psync_lo\@:
error 1 in line 4 of "capture_psync_lo": illegal operand types
        called from line 74 of "tubevc.s"
>   bne    wait_psync_lo\@

In the first line of the error it has inserted a '1' into the label so it is trying to do something. I can get the labels to work by using a standard argument and passing that with a different value for each macro call. However the second macro seems to also be messed up as I get the error:

error 2 in line 76 of "tubevc.s": unknown mnemonic <CAPTURE_PSYNC_>
>   CAPTURE_PSYNC_HI

With the macro label truncated. This happens even when the labels work using the argument method above. Any idea what the problem is?

hoglet67 commented 2 years ago

The version of VASM we have checked in to PiTubeDirect dates back to March 2018.

It might be worth you trying rebuilding from the latest sources: http://sun.hasenbraten.de/vasm/

IanSB commented 2 years ago

I downloaded the latest stable release and that fixed the \@ problem but not the truncated name second error. It seems that it doesn't like macros with the first part of the name the same. I fixed it by changing the names to: LO_PSYNC_CAPTURE HI_PSYNC_CAPTURE

IanSB commented 2 years ago

Another update:

I've got capture working with c0pperdragon's simple board which was slightly more difficult as the h sync detection has to be done in the GPU as well because it has to use the psync edges to sample the sync. (In CPLD mode the psync signal is suppressed during and slightly after sync which leaves enough time for the Arm to sample the sync and hand over to the GPU for capture)

It would be preferable to have CPLD sync detection in the GPU as well but the sync time has to be measured for the smooth sideways scrolling to work so I haven't transferred it yet as I'm not sure if there is a suitable timer available in the GPU to measure this. (The CPU instruction cycle counter is used by the Arm for such measurement)

Have you looked at timing things in the GPU at all?

Going forward there is great scope for simplifying the capture code: At the moment there are a huge number of optimised capture loops for combinations of options but now that the time critical part is in the GPU, more generalised capture loops with multiple branches could be used instead.

The six word MBOX buffer means that any capture loops can cope with up to 11 psync edges of latency compared to just 1 at the moment (each word has two 12 bit samples) .

For comparison here are some more benchmarks showing the timing for accessing the GPIO and MBOX registers:

Original Pi zero:
ARM: GPIO read = 44ns, MBOX read = 40ns, Triple MBOX read = 47ns (15ns/word)
GPU: GPIO read = 31ns, MBOX write = 9ns
RAM: Cached read = 3ns, Uncached screen read = 96ns

Pi zero 2W:
ARM: GPIO read = 60ns, MBOX read = 55ns, Triple MBOX read = 120ns (40ns/word)
GPU: GPIO read = 35ns, MBOX write = 10ns
RAM: Cached read = 2ns, Uncached screen read = 115ns

The MBOX reads are about 5ns faster that GPIO reads on the ARM side on both zero versions and I found that doing a triple MBOX read of three registers produced a significant speedup of the effective per register read time compared to reading the registers individually by using the following technique: (Assume r3 points to the triple register base which is 64 bit unaligned)

ldr r2, [r3, #8]     //read r2 in a loop until psync toggles
ldmia r3, {r0, r1}   //unaligned read of two words is faster than separate ldrs

I haven't used this technique so far as the single read works very well but it does promise even faster reads. The main problem with the above is that it introduces a minimum of 6 psync edges of latency compared to 2 edges minimum with the current code which is not a problem at 12bpp or higher as that's only 6 pixels but at lower sample rates with lower pixel clocks that is a significant delay (3bpp is 24 pixels and with a 4Mhz pixel clock like the ohio superboard that's a delay of 6 uS) which might mean the renderer runs out of time before the next sync pulse so it would only be useful at high bit depths and clock rates.

hoglet67 commented 2 years ago

Have you looked at timing things in the GPU at all?

No, we haven't yet had to time anything in GPU land.

You should be able to use the System Timer (0x7E003004) or the ARM Timer peripherals from the GPU.

See here

Dave

hoglet67 commented 2 years ago

There also appears to be a Core Timer register (p12 = PRCORTIM) within the VP Core that can be read with the MOV rn, p12 instruction.

Dave

IanSB commented 2 years ago

The assembler didn't seem to recognise the pxx registers but I did get it working by counting loop iterations as a low resolution timing substitute. However there seemed to be more jitter that by triggering from the sync in the Arm so I've abandoned that idea for the moment. (jitter not related to counting loop iterations)

I also decided to look at the Pi 4 again: It was partially working a year ago but seems to have gotten broken by subsequent changes although I did manage to get as far as the benchmark code with the following results:

ARM: GPIO read = 14ns, MBOX read = 13ns, Triple MBOX read = 28ns (9ns/word) GPU: GPIO read = 35ns, MBOX write = 8ns

This time it looks like the Arm is much faster than the GPU for GPIO access

The remaining issues for the Pi 4 were reading the SD card and reading the Pixel Valve registers (for resolution readback and genlocking)

Have you done any work on the Pi4 with PiTubeDirect that might help in this area?

hoglet67 commented 2 years ago

The Pi 4 uses a different GPU (VC6 rather than VC4) so iI guess it's not surprising there are differences.

The BCM2711 peripherals documentation is here: https://datasheets.raspberrypi.com/bcm2711/bcm2711-peripherals.pdf

Compared to the BCM2835, it doesn't seem to have a documented eMMC controller, which is why I think the SD card is broken.

The address and format of the Pixel Value registers is different.

Check the RPI4 defines here: https://github.com/hoglet67/PiTubeDirect/blob/hognose-dev/src/framebuffer/screen_modes.c

Dave

hoglet67 commented 2 years ago

This thread has some possible pointers for the Pi4 eMMC2 controller: https://github.com/raspberrypi/documentation/issues/1209

hoglet67 commented 2 years ago

And more specifically, this comment about selecting the legacy eMMC controller: https://github.com/raspberrypi/documentation/issues/1209#issuecomment-513797407

hoglet67 commented 2 years ago

The assembler didn't seem to recognise the pxx registers but I did get it working by counting loop iterations as a low resolution timing substitute. However there seemed to be more jitter that by triggering from the sync in the Arm so I've abandoned that idea for the moment. (jitter not related to counting loop iterations)

You could always just hand-assembler the instruction.

The format is documented:

1100 1100 001 d:5 0000 0000 000 a:5        mov rd, pd                     rd = pa

So it's just a case of working out the right 32-bit word.

IanSB commented 2 years ago

This thread has some possible pointers for the Pi4 eMMC2 controller:

That worked and I can now read from the SD card.

I'm now actually getting images out of it but there are still lots of issues: It works with firmware from Feb 2020 (which was our current release FW) but I updated that to support the zero2w and now it hangs just after initialising PLLA Also 12bpp modes don't work. These were achieved by manipulating the display list and that might have moved as well

hoglet67 / RGBtoHDMI

Investigate offloading capture to GPU #250