dual core VPU? - Githubissues

fanoush commented 5 years ago

Hello. Are there actually two VPU cores? There are several hints about it in different places but nothing definite. If yes do you know how to start the second one or how it behaves at boot time? Are they equivalent and can do same stuff?

christinaa commented 5 years ago

Yes there is a 2nd core, the early code in my firmware (IIRC) actually checks if it's on core 0 or core 1 and if it's on core 1 it asks cprman to shut it down and then goes into a loop. Since my firmware is only a SPL-type firmware, it does not use the 2nd core. As far as the second question goes, I guess they're equivalent, but I never personally looked into the MMIO ranges responsible for multicore execution as well as any ISA extensions for it.

fanoush commented 5 years ago

Thank you. Later I also checked Brcm_Android_ICS_Graphics_Stack.tar.gz released by Broadcom and there is interesting stuff in brcm_usrlib\dag\vmcsx e.g. vcinclude\hardware_vc4.h (and everything in vcinclude\bcm2708_chip) Also vmcsx\helpers has tons of interesting VPU assembly stuff. Also some include files mention VPU0 and VPU1 so yes there appears to be two VPU cores after all. I was just surprised that no details are documented here.

BTW thanks for your firmware it is a very good start together with the vc4 gcc toolchain. I am thinking about porting micropython or espruino or other small interpreter to VC4 VPU - the $5 Zero may be good enough for typical 'arduino' stuff even with just VPU running and ARM core turned off. However another reason is to have something to poke VC4 registers from - some scripting language to figure out stuff in more interactive way than compiling C fragments over and over.

the early code in my firmware (IIRC) actually checks if it's on core 0 or core 1 and if it's on core 1 it asks cprman to shut it down and then goes into a loop

Could you point me to it? Because I checked your code before and noticed that the _main method takes 'unsigned int cpuid' however it runs uart and sdram init code https://github.com/christinaa/rpi-open-firmware/blob/master/romstage.c#L130 and main is called from https://github.com/christinaa/rpi-open-firmware/blob/master/start.s#L117 so I did not see any code there that stops the second core or run different code based on cpuid (or other id?). That was second reason why I was not sure there are indeed two cores running. So if it is somewhere later then sdram, pll, uart init code is called twice - once by each VPU core ?

BTW what is your source of VC4 info (if you can/want to answer)? There are lot of magic constants in your code which makes it hard to figure out stuff (pll, sdram setup) but you obviously know it from somewhere.

thubble commented 5 years ago

The second core is initially not executing (although based on what Kristina said, it may be powered on?). To start it, simply write the start address to IC1_WAKEUP (0x7e002834) and it will immediately start executing there.

The currently-executing core is determined by bit 16 in the version instruction value (set = core1, clear = core0). The first thing the default bootcode.bin executes is this:

version r0 
btest r0, 16
bne L_Core1Entry
;Core0-only code here

As far as I'm aware the 2 cores are identical. There is only 1 vector register file, so all vector code uses mutexes in the default firmware.

fanoush commented 5 years ago

Oh, thank you. That's interesting. And it is great someone is listening :-)

Also I have other questions unrelated to the topic that you may possibly know. How can I control 128K L2 cache after I enable DRAM and then plan to turn it off. Is it always at same address like at the boot start? I guess not since it is possibly just prepopulated mapping for L2 cached address 0 (?) without any backing store(?)

I saw there is also some bootrom RAM area(?) that could be used for running code? Are there also some other spare memory mapped SRAM buffers that could be possibly (ab)used for data or code? like e.g. memory for USB enpoints or stuff like that. Basically I am checking how much RAM there is without enabling SDRAM or when it is put to sleep. Also the bootcode.bin code starts at nonzero offset, is the memory above it usable? Why it doesn't start at offset zero then?

christinaa commented 5 years ago

The entire VC4 side of my firmware runs in VPU cache, which is 128K if I recall correctly. ARM stuff runs in SDRAM since it cannot run in that mode. If you want to load a second stage firmware (like start.elf) onto the VPU you would have to copy the bootcode into an SDRAM region without cache (The whole address space is partitioned in 4 "mirrors", in other words, 2 bits of the address determine cache properties of that access).

Once VPU is running in RAM, changing anything about RAM or accessing the SDRAM controller requires an undocumented cache dance before making it cooperative (where you can manually do stuff like query MR registers), which is roughly this (writable and executable code). You will have to do that every time you want to do something like reclock SDRAM once you disable cache-as-RAM the first time. Also I think the cache-as-RAM to RAM execution requires a similar trampoline (this is much much smaller than 128KB):

Note: 
   CLSCn = Cache Line Sized Chunk n
   ECLSCn = End of Cache Line Sized Chunk n

[       Start       ]
[SetCond,JmpTo CLSC1]
[      CLSC1        ]
[     FUN_PART      ] <- If condition is not set the stuff below will actually run fully.
[CondJump to  ECLSC1]
[Code/SDRAM disable ] <- Code doesn't exec with cond, just jumped over to prime cache. 
[  New SDRAM param  ] <- Just data, jumped over by code regardless
[      ECLSC1       ]
[      CLSC2        ]
[CondJump to  ECLSC2]
[       Code        ] <- Same, jump over, without executing.
[      ECLSC2       ]
[      CLSC3        ]
[CondJump to  ECLSC3]
[       Code        ] <- Etc ...
[      ECLSC3       ]
...... etc etc ......
[      CLSCn        ]
[CondJump to  ECLSCn]
[      Code         ] <- Will reenable SDRAM when runs
[      ECLSCn       ]
[    UnsetCond      ] <- All the cache aligned chunks are in cache now
[   JmpTo FUN_PART  ] <- Fun part begins: run same code from cache.

During the dance ARM is stalling and VPU will lock up if accessing any memory while SDRAM Controller is off, hence the need to copy data into that region. I don't know why bootram is not used for this, either it's too expensive to copy code there and run from there or it requires fully enabled cache as RAM which may require a lot more teardown.

(I'll note that I did attempt that and once SDRAM controller is on, doing the above from bootram will lock up when trying to access some the SDRAM controller MMIOs (especially trying to use MR registers), but the above somehow doesn't. Who knows why.)

This hardware is odd.

fanoush commented 5 years ago

There is only 1 vector register file, so all vector code uses mutexes in the default firmware.

was just explained here and in followup post https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=234167#p1432851 so there are 2 files but one vector unit

EDIT: and also here about how it is shared between VPUs https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=234167#p1438883

thubble commented 5 years ago

There is only 1 vector register file, so all vector code uses mutexes in the default firmware.

was just explained here and in followup post https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=234167#p1432851 so there are 2 files but one vector unit

EDIT: and also here about how it is shared between VPUs https://www.raspberrypi.org/forums/viewtopic.php?f=29&t=234167#p1438883

Ah, thanks - that clears some things up! I noticed in the stock firmware that a lot of VRF-accessing code uses locks ("vclib_obtain_VRF()"). I assumed it was to avoid the 2 cores writing to a single VRF simultaneously, but now that I think about it, 2 threads on the same core would have the same issue.

DarumasLegs commented 4 years ago

I have a random question for this audience - I am looking for a way to get super accurate (sub millisecond latency) timestamps for individual video frames while running Raspivid and/or Picamera. I am developing a multi camera video system using RPi Compute Modules and RPi Camera Modules v2.1, and I need to log timestamps from a RTC on the Compute Modules as closely as possible to the time the camera sensor either starts or finishes imaging a frame. As it is now, with Raspivid and Picamera, the Presentation Timestamps are not fine-grained enough (I need millisecond accuracy), and are only captured when the frame makes it through the GPU to the CPU. I want a signal either from the camera module directly or the GPU as closely as possible to the time the light hit the image sensor and the sensor imaged each frame. Is this possible?

phire commented 4 years ago

want a signal [snip] as closely as possible to the time the light hit the image sensor and the sensor imaged each frame.

This isn't really possible. The camera module is a CMOS sensor and there isn't really a single time for either of those events.

Light is collected for many miliseconds, quite possibly a full 16ms (a shorter collection results in less blur, but a less accurate representation of the light). The camera module signals the pixels to start collecting light and they will collect light until the camera module tells them to stop, summing the result into an analog value.

Then there is the scan out process. With light collection stopped, the camera module will scan across each row, one pixel at a time, reading the stored charge with an analog to digital converter. These digital values will be streamed down the cable to the SOC.

In the highest resolution/framerate, this scan out process will take a full 16ms (when running in 60fps mode) to read the entire frame of data. Why a full 16ms? Because if it took less time to scan out, then the module would support a higher resolution/framerate.

Which brings up another issue... If it takes a full 16ms to collect light, and a full 16ms to scan out a full frame of data, then how is it doing both at the same time?

For a cheap camera module like this, the answer is a rolling shutter. Assuming 60fps again, The first line will end it's 16ms worth of exposure at 0ms and then the line will be scanned out over 0.016ms. Then at 0.016ms, the second line will stop it's 16ms of exposure, the first line will start it's next 16ms of exposure and the second line will be scanned out.

The very last line of the frame won't be scanned out until about 16ms after the first line, and will have collected it's light from an almost completely different time period than the first line.

This works fine for still image, but creates a noticable distortion when you have objects moving quickly across your frame.

All the numbers in this comment are simplified and rounded based on generic CMOS sensors, but will hopefully get my point across.

However, for your use case you might want to look into hacked high framerate recording modes, which sacrifice resolution and noise for much higher framerates, which might get you milisecond resolution.

DarumasLegs commented 4 years ago

Thank you for the rapid and thoughtful reply - I really appreciate it!

I understand that frames are imaged line by line with a rolling shutter. My use case only requires 720p30fps, and I need accurate timestamps for the frames within 1/30 sec (within one frame). Is it possible to know precisely when the imaging for the first line begins - or, alternatively, the imaging for the last line ends? Either when the light is collected or when the scanning begins? I don't mind a little latency (particularly if it's within 1/30 second) as long as it's constant and deterministic. My software can adjust the times if necessary after the video files are uploaded to my application in the cloud.

phire commented 4 years ago

Cool, if your usecase is ok with rolling shutter then you should be able to get something working.

One potential option might be using a really short external synchronisation pulse of light to mesure the timing. Trigging it with a gpio.

As for a software solution, unfortunately I know next to nothing about the CSI2 interface or the ISP block. But take a look at raspiraw source code.

It directly controls the CSI2 interface and camera modules via I², I think it DMAs camera data directly into userspace memory before writing it to disk, bypassing any processing or latency inherent to the ISP block.
It also outputs timestamps to a resolution of several microseconds.

Take a look, either you can use it directly for your usecase or modify it to meet your needs.

hermanhermitage / videocoreiv

dual core VPU? #14