hoglet67 / RGBtoHDMI

Bare-metal Raspberry Pi project that provides pixel-perfect sampling of Retro Computer RGB/YUV video and conversion to HDMI
GNU General Public License v3.0
845 stars 115 forks source link

24 BPP support by multiplexing into the existing 12 color inputs #191

Open matthiesenj opened 3 years ago

matthiesenj commented 3 years ago

I am looking at providing 24 bit support (mainly for C= Amiga) by multiplexing the 12 GPIOs used for 12 bit color. Basically, an output on the Pi will switch between the 4 high/low color bits of each gun before reading them. I suppose the firmware already extends existing data to 8 bits per gun (for HDMI) by repeating the input bits 8, 4, or 2 times?

I had a look at the code, but didn't find a point of attack, especially for the mux select output pin - it seems all GPIOs are already in use, so is there one which can be repurposed for this? The JTAG TDI pin (GPIO0), maybe, which isn't used outside CPLD programming, and won't be needed for mux'ing while CPLD is being programmed?

If a suitable pin could be found, the 12 BPP mode could just always toggle this mux output to support 24 BPP - if the hardware doesn't alternate the inputs, it would just result in the regular 12 bit pixel.

hoglet67 commented 3 years ago

Even if you could find a spare GPIO pin, I very much doubt there is sufficient memory bandwidth on the Pi to deal with copying 24BPP data into memory (the 12-bit mode used a 16 BPP framebuffer).

matthiesenj commented 3 years ago

Hmm.. That's a bit of a bummer.. Is this due to the interface between the cpu and the rest of the chip? I mean, the thing is capable of 1080p (which is obviously not needed here), so the memory bandwidth in itself should be good.

I've noticed indications in this project for running the code on the zero's bigger brothers, I suppose that would help with this issue? Edit: This project supports very large hdmi resolutions, is this accomplished using gpu scaling of a much smaller resolution framebuffer?

hoglet67 commented 3 years ago

Is this due to the interface between the cpu and the rest of the chip? I mean, the thing is capable of 1080p (which is obviously not needed here), so the memory bandwidth in itself should be good.

The bottleneck is data being copied (and reformatted) between GPIO and framebuffer memory using the ARM core. Reading GPIO is pretty slow, and the framebuffer is in uncached menory. There is very little slack in this process with a 16-bit frame buffer and with the higher pixel clocks used by the Amiga.

I've noticed indications in this project for running the code on the zero's bigger brothers, I suppose that would help with this issue?

Actually it doesn't. The memory timing of the multicore systems is much more variable, and for this application (which is very sensitive to memory latency) they don't work nearly as well. That's why they are no officially supported.

This project supports very large hdmi resolutions, is this accomplished using gpu scaling of a much smaller resolution framebuffer?

Yes - the frame buffer typically matches the resolution of the original system (or sometimes 2x the vertical resolution), and the GPU does the scaling.

By all means do some experimentation yourself, but I think you will be dissapointed,

matthiesenj commented 3 years ago

The bottleneck is data being copied (and reformatted) between GPIO and framebuffer memory using the ARM core. Reading GPIO is pretty slow, and the framebuffer is in uncached menory. There is very little slack in this process with a 16-bit frame buffer and with the higher pixel clocks used by the Amiga.

Ok, sounds like a no-go then, unless some different way of doing it can be found - isn't it possible to enable L1 write cache on the framebuffer so flushes could be done in bursts? This would help with latency (presuming memory bandwidth is plenty). I read that there's also a small "write buffer" between the arm and the rest of the chip - is this also circumvented when writing to the framebuffer?

I've noticed indications in this project for running the code on the zero's bigger brothers, I suppose that would help with this issue?

Actually it doesn't. The memory timing of the multicore systems is much more variable, and for this application (which is very sensitive to memory latency) they don't work nearly as well. That's why they are no officially supported.

I wasn't thinking of multicore, but rather the higher cpu and memory clocks.

IanSB commented 3 years ago

@matthiesenj

isn't it possible to enable L1 write cache on the framebuffer so flushes

Yes, that is something I intend to look at someday (i.e. mark a video line as cacheable and then force a flush at the end of the line) but the main issue is the slowness of reading the GPIOs: I've measured the time to execute a single GPIO read and it's of the order of 45ns without overclocking. A single pixel cycle on an Amiga is of the order of 70ns so there is no time to consistently do a double read and that is the main bottleneck.

I read that there's also a small "write buffer" between the arm and the rest of the chip

The capture code is crafted to make maximal use of the write buffers and the STM instruction used to write to the screen memory is not stalled because of that so there hasn't been any urgency in investigating the cache option as it doesn't provide much extra benefit.

I wasn't thinking of multicore, but rather the higher cpu and memory clocks.

Memory timings on the Pi4 might be better but I haven't got it fully working on that yet due to hardware changes but on the Pi2 & Pi3 they are actually slower for uncached memory writes so don't perform as well as the zero. I suspect the Pi4 is just as slow as the others for GPIO reads so again unlikely to help.

Setting the Overclock core option in the settings menu does speed up GPIO reads and that has to be done on the Amiga to provide a little headroom and even moreso with the Atari ST due to it's 16Mhz pixel clock but you would never be able to get it working fast enough for double rate GPIO reads.

The only other performance option might be to read the GPIOs from the VPU instead of the ARM which would provide some parallel processing and allow more manipulation of the video data but again it is unlikely to speed up the GPIO reads themselves.

matthiesenj commented 3 years ago

I've measured the time to execute a single GPIO read and it's of the order of 45ns without overclocking. A single pixel cycle on an Amiga is of the order of 70ns so there is no time to consistently do a double read and that is the main bottleneck.

Yes, that's definitely problematic for double reads. I asked c0pperdragon how the data was read, and he more or less confirmed that the Pi poll's the GPIO bank in a loop looking for a transition of the pixel clock, and when it happens, feeds the rest of the word on for pixel processing - if each polling cycle is ~45ns, I suppose this also contributes to the synchronization issues of those Amiga adapters regarding pixel clock - there isn't much room to have the flipflop and pixel clocks far apart to ensure correct sampling.

The only other performance option might be to read the GPIOs from the VPU instead of the ARM which would provide some parallel processing and allow more manipulation of the video data but again it is unlikely to speed up the GPIO reads themselves.

I have done some GPIO coding for the Pi3/4 before, and I got the impression that the slow GPIO is due to the cpu having to cross over to the peripheral domain, and that if using DMA or the VPU, GPIO might be much faster. DMA can be used to write GPIO, but if it can be set up to read, possible using a pin edge trigger, I don't know.