marqs85 / ossc

Open Source Scan Converter
http://junkerhq.net/xrgb/index.php/OSSC
GNU General Public License v3.0
470 stars 63 forks source link

Add a PSP optimized Line3x mode. #52

Closed eatnumber1 closed 4 years ago

eatnumber1 commented 4 years ago

This change should not be merged as-is. Also please review https://github.com/marqs85/ossc/pull/50 and https://github.com/marqs85/ossc/pull/51 first, as this change builds on those.

I'm getting some odd pixel flicker in this mode. You can see an example here (make sure to set the quality settings to max).

What this does is 3x vertical, 2x horizontal. The resulting pixel clock should be 162 MHz (according to my timings here), which I think both the IT6613 and TVP7002 support. Here's my math for the 162 MHz number: (1/0.0370373094638695)*2*3 = 161.9988084138 MHz.

The changes I made here are quite messy, but serve to increase fmax of pclk_3x to 183.25 MHz, above the 162 MHz desired clock rate. Unfortunately, that change did not clean up the pixel flicker.

Interestingly, if instead of doing sample_mult = 2, if I instead do TX_PIXELREP_2X, I get no flicker. So it seems like the cause of the flicker is likely not the IT6613 chip.

My intent is that if I can get this working properly, I'd then be able to decouple the pixel clock multiplier from the active region multiplier, and pixel multiply into the overscan. With a 2x horizontal pixel clock, there is enough overscan for a 3x horizontal active region multiplier.

With these changes, here's the full set of fmax timings:

Fmax Restricted Fmax Clock Name
33.8 MHz 33.8 MHz clk27
35.49 MHz 35.49 MHz altera_reserved_tck
131.46 MHz 131.46 MHz pclk_27mhz
131.46 MHz 131.46 MHz pclk_2x
136.78 MHz 136.78 MHz pclk_1x
136.78 MHz 136.78 MHz pclk_5x_source
136.78 MHz 136.78 MHz pclk_4x_source
136.78 MHz 136.78 MHz pclk_2x_source
136.78 MHz 136.78 MHz pclk_3x_source
153.61 MHz 153.61 MHz pclk_4x
170.62 MHz 170.62 MHz pclk_5x_postmux
170.62 MHz 170.62 MHz pclk_4x_postmux
170.62 MHz 170.62 MHz pclk_3x_postmux
170.62 MHz 170.62 MHz pclk_2x_postmux
170.62 MHz 170.62 MHz pclk_27mhz_postmux
170.62 MHz 170.62 MHz pclk_1x_postmux
178.73 MHz 178.73 MHz pclk_5x
183.25 MHz 183.25 MHz pclk_3x
marqs85 commented 4 years ago

Similar issues have been reported occasionally in certain builds, and we have concluded them to be setup violations in FPGA reg-reg paths (reducing multiplier / sampling rate helps). Whether they occur depends a lot on silicon, and also on the build itself. Even though the design would meet timing accoring to analyzer, some builds are more prone to these. We have usually re-run compilation with different fitter seed as a workaround. For ~183MHz pixel clock, I think actual RTL changes are needed which optimize timing.

It might be worthwhile first trying out the test pattern at similar pixel clock just to ensure there are no bottlenecks in IO timing and that IT6613 can handle pixel clock which is 13% over its spec.

eatnumber1 commented 4 years ago

Sorry to be clear, 183 MHz is the new fmax, but I believe that I am only driving it at ~162 MHz, which seems to be the documented max of the IT6613.

If these issues are caused by setup violations in reg-reg paths, I'd imagine that it should be possible to get these to show up as timing violations if I can properly express them as constraints, but that would require pinning down exactly where the setup violations are occurring.

I'll try with a different fitter seed to see if that resolves things, and try a bit harder at RTL improvements.

Also, what test pattern are you referring to? The built-in one? I have tested with a black/white checkerboard pattern on the PSP itself, which interestingly does not show the issues. It's mostly only on pictures with lots of varying color that the issues surface most clearly.

marqs85 commented 4 years ago

Ok, 162MHz should generally work ok since some 5x modes already reach that. Ideally the violations should flag up if the underlying logic timing models are accurate. Perhaps the automatically calculated clock uncertainty is insufficient (likely) or they are masked by invalid constraints (unlikely).

Overall the RTL in scanconverter.v has evolved to a mess over the years which unfortunately makes it harder to debug/optimize. I'm in the process of rewriting it for the Pro model which simplifies it significantly as is already evident in the current version. Perhaps it can be eventually backported on the original model even though its PLLs are much more limited.

With test pattern I mean the build-in boot-up pattern, yes. Most of it are solid color or mix/max B/W values which may make it hard to e.g. see violations in LSBs, though. Another thing to try is bypassing the linebuffer (in normal mode) in order to see if it could be the bottleneck.

eatnumber1 commented 4 years ago

Tried three different seeds, all had the same problem. Will try reproducing the issue using the test pattern.

marqs85 commented 4 years ago

How much you need to decrease H. samplerate in order to make the issue disappear? That could give some indication how much setup uncertainty should be applied.

eatnumber1 commented 4 years ago

That's a good idea. Looks like I need to decrease it from 858 to around 770. So that's a clock speed of (770*525*59.94*2*3) about 145 MHz, or a delta of about 6.8ns.

eatnumber1 commented 4 years ago

Sorry, the delta is 0.7ns (((1/(770*525*59.94*2*3))-(1/(858*525*59.94*2*3)))*1000000000)

marqs85 commented 4 years ago

145MHz vs 170.62MHz (pclk_3x_postmux on the table) is relatively big difference, especially considering fmax is calculated using slow silicon at 85C model. Could it be that the problem is in IO timing? Perhaps you should try adding that 0.7ns to IT_Tsu in ossc.sdc and see if those paths can be optimized and whether it helps-

eatnumber1 commented 4 years ago

Yeah, I think the issue is in the output timing. I just got a (very messy) build that works by ripping out a bunch of the post processing pipeline and by setting "fast output" on the HDMI_TX pins (ref). I need to do some work to figure out exactly which change I made fixed things though, but I'll do that another day as it's 5am here :).

eatnumber1 commented 4 years ago

I've been testing out the limits of my local changes, and I'm able to get a glitchless picture up to a H. Active of 991, which equates to a clock rate of 991*525*59.94*3*2 = 187.11 MHz. At 992 the picture cuts out, which reads to me like an issue with the IT6613 chip rather than the FPGA, although I could be wrong there.

Sadly, that is very slightly below the 189 MHz that #56 was asking for.

It just might be possible to eke a little bit extra out of the IT6613 chip by reducing the color bit depth (aka #42), but that will only affect the HDMI output side of the chip, as the input side is fully parallel.

eatnumber1 commented 4 years ago

On the output side, the IT6613 is documented as having a maximum TMDS link clock rate of 225MHz, although I don't know TMDS well enough to calculate what the effective clock rate is given the inputs that I'm driving it at.

I re-enabled all of the logic on the output path I had previously disabled (post-processing, test pattern, etc), and it still works. My remaining changes are the "fast output" setting on the HDMI_TX pins and a bunch of RTL changes I made to increase fmax in line3x. My RTL changes make the code much less readable, and unfortunately I don't think there's an easy way to preserve the performance improvements without taking the readability hit, or without rearchitecting the entire scanconverter logic pipeline. Maybe for OSSC Pro?

Still need to clean up my changes, and test if my RTL changes truly are necessary. That's for another day though.

marqs85 commented 4 years ago

What IT_Tsu you are using and how much slack you see on reg->out paths on the current run? If IO is the only bottleneck, then maybe those fast output settings are all that is needed (at least for the ~160MHz pclk in this ticket).

eatnumber1 commented 4 years ago

I've updated this PR with my current changes.

This is without my RTL changes, just the fast output settings.

With my RTL changes, using an IT_Tsu of 1.5 and an IT_Th of 0.3, everything works and I get no timing violations.

With the current code I have in this PR, I've removed my RTL changes, leaving just the fast output settings. With this current code however, there's setup timing violations of about -1.13ns slack in pclk_3x, but the fw image does work with my PSP without glitches.

I'm not sure what the impact is of these pclk_3x timing violations is. Like I said, this works for me, but with these current changes I also increase the input clock speed of pclk_3x_source to 54 MHz, which is what causes these timing violations to surface.

Should I go ahead with this PR in spite of the timing violations? Or would you like me to put back my RTL optimizations? Or something else?

marqs85 commented 4 years ago

This PR is OK as-is, I checked that the remaining violations are on reg2reg paths and should be able to fix those by precalculating some values.

eatnumber1 commented 4 years ago

Oh, okay, that works then. :)

Just so you know I had put the words "temporary commit" in 2995f437287228cd98f47cf73537aa73559a2459 because 2x by 3x for the PSP is non-aspect ratio preserving, and I thought you'd not want that as the line3x mode, so I had intended to amend the commit to disable the ability to select it for this PR.

As I mentioned, I intend to also try and implement 3x by 3x via drawing into the overscan, at which point I would have re-enabled line3x, but I haven't started implementing that, so it may take me some time.

All that said, with my 4k TV which repairs the AR when it upscales the rest of the way, the current 2x by 3x line3x mode looks good, much sharper than the line2x mode.

marqs85 commented 4 years ago

Missed that message in the other commit, but it's reverted now.

marqs85 commented 4 years ago

I looked into the fast output settings more closely and became a bit concerned that there's almost zero hold slack on reg->out after the changes (which included 200ps hold constraint relaxation). Some of the FPGA ports connected to HDMI TX are VREF pins which have higher IOBUF delay (when used as IO instead of voltage reference). In general these rarely become hold-critical, but they had very little setup slack with the original SDC. For the other ports, the situation is opposite. After some testing I figured that a much better overall setup/hold balance (>0.4ns setup&hold slack with 1.5ns/0.5ns constraints for IT6613E) could be achieved by just enabling fast output settings for VREF pins:

set_instance_assignment -name FAST_OUTPUT_REGISTER ON -to HDMI_TX_RD[1] set_instance_assignment -name FAST_OUTPUT_REGISTER ON -to HDMI_TX_RD[5] set_instance_assignment -name FAST_OUTPUT_REGISTER ON -to HDMI_TX_RD[7] set_instance_assignment -name FAST_OUTPUT_REGISTER ON -to HDMI_TX_GD[7] set_instance_assignment -name FAST_OUTPUT_REGISTER ON -to HDMI_TX_BD[0] set_instance_assignment -name FAST_OUTPUT_REGISTER ON -to HDMI_TX_BD[3] set_instance_assignment -name FAST_OUTPUT_REGISTER ON -to HDMI_TX_BD[7]

Could you verify that works with your board without artifacts?

eatnumber1 commented 4 years ago

Thanks Marqs, I'll test this weekend if you're willing to wait that long.

If not, I should be able to make some time.

marqs85 commented 4 years ago

No rush, take your time.

eatnumber1 commented 4 years ago

Interesting, I didn't know about the VREF performance difference.

I've tested with your suggested changes, and it works great. I'm even able to restore the original constraints, which I also realized that you got those from the IT6613 datasheet.

I've just sent you #61 which implements these changes.

marqs85 commented 4 years ago

Okay, good to hear it works. I'd still add that 0.5-0.7ns margin on top of IT_Tsu since you witnessed issues with the original constraint even though timing checks were clean. It should be no problem to meet the timing now that the VREF pins are taken care of - most likely there's already at least that much positive setup slack on IO.

eatnumber1 commented 4 years ago

Sure, I've put back the IT_Tsu margin in #61, but doing so does reintroduce a small timing violation:

pclk_3x setup -0.037ns

marqs85 commented 4 years ago

That's fine since pclk_3x domain is for reg-reg paths - any small change in RTL or constraints effectively resets the seed fitter uses for initial placement and can thus change timing elsewhere. I'll just iterate the seed to get rid of small violations when building the release image.