ToDo list or list of differences?

iDoka commented 2 years ago

Hi Lawrie and thanks for that great work!

As you mentioned:

This implementation has been done from the specification, without access to any Raspberry Pi HDL. It is currently incomplete, but some programs run in simulation and on open source FPGA boards. The current method of configuring and controlling PIO from a top-level module is different from that used on the RP2040 chip, and will probably be changed for closer compatibility.

As you mentioned in README, your implementation slight incomplete and have differences than PIO in RP2040 chip.

Would be great to known todo bullets and/or details of differences between two implementations to better understanding how to achive almost 100% compatibility with PIO in RP2040 chip. 🙏🏻

Have a nice hacking!

lawrie commented 2 years ago

Sorry about the late reply. It is quite a while since I worked on this, so I don't remember very accurately.

I think there is an error in autopull/autopush where an extra cycle is being used that shouldn't be
Merging the results of multiple state machines to produce the output pin state is not implemented
Combining Fifos together to produce a deeper Fifo is not implemented

iDoka commented 2 years ago

Lawrie, thanks for your reply! Would be great to add this limitations in README.

xobs commented 1 year ago

Note that the second item in your bullet points -- merging the results -- is handled by the patchset in https://github.com/xobs/fpga_pio/commit/6170f345b9f5456ae65571f38534d053550d43d5

xobs commented 1 year ago

Also note that interrupts do not appeared to be wired up -- irq0 and irq1 are not attached to anything.

lawrie commented 1 year ago

Sean, many thanks for making the improvements - I have merged the pull request. It is exciting that this in going to be included in the MPW-8 shuttle run.

bunnie commented 1 year ago

Do you happen to have any more info on point 1?

I think there is an error in autopull/autopush where an extra cycle is being used that shouldn't be

Any hints on how I could try to produce/test for this particular condition, and what sorts of things it would impact?

lawrie commented 1 year ago

Bunnie, It is some time since I worked on this, so I cannot remember the full details. I remember when I ran the simulations or looked at output with a logic analyser or scope, in some cases there was an extra clock cycle that shouldn't have been there, so a transition took an extra clock cycle. I seem to remember that it occurred with the spi example, but it probably happens on anything that uses autopull or autopush.

bunnie commented 1 year ago

Thanks! That's very helpful!

bunnie commented 1 year ago

I think I found what you''re talking about:

Does this extended clock cycle at the end of the SPI transaction jog your memory? Mostly just want to make sure I am covering the base that you described in the to-do list. If this doesn't seem like the issue you were worried about, I will keep searching for it.

Regardless, the fix to this is basically to "look-ahead" during an auto-pull. I think the problem is right now that when you do an auto-pull from a FIFO, it fetches the value into the output shift register, and then it pushes it to the pins, so the instruction ends up taking an extra cycle longer to execute than it should. I added a set of bypasses that, in the case the OSR is being loaded, it "reaches around" to the input value and presents it to the output (it also pre-shifts the value as well). Basically I hoist the shift computation an cycle earlier. This is going to be pretty terrible for the critical path (effectively doubles it) but also I think the block doesn't need to close at such tight timing levels for my application, so this is fine.

The actual fix is a bit complicated, and I've had to layer a lot of changes on your API anyways to integrate the block for testing (I implemented the full RPi-compatible register set so I can run their test programs directly on my RV32 core), and I also implemented some of the other to-do's that were missing on your list. However, this is the specific commit that seems to fix it:

https://github.com/buncram/fpga_pio/commit/ebc18c527d1212ff063f13c37f42216dfc842380

I'm not quite sure how to merge these larger changes back into this repo, because the client I'm doing this for is requiring me to wire this into an APB bus (but a least it's all permissively open source licensed), so the top level pio.v file has forked into this systemverilog monstrosity:

https://github.com/buncram/cram-soc/blob/cram/deps/pio/pio_ahb.sv

Once I get everything working, I'll try to clean it up and turn it into a module that can be cleanly integrated as a LiteX peripheral with some rube-goldberg AXI-to-APB bridge in between, but I have a lot more test vectors to write and run...

lawrie commented 1 year ago

Again, I can't remember the full details, but what you have identified in the SPI example looks like what I remember seeing. I think I did look briefly at a solution to the autopull/push issue and it did not look simple. Your solution sounds convincing.

If I had continued working on the project I would have got it working with an RV32 core. I mainly use picorv32 or VexRiscv. There are a variety of buses that could be used for such integration.

I am very pleased to see all these fixes and improvement done to my code. I merged your pull request.

I did look at how FIFO chaining might be done and it didn't look too hard, but I did not get round to doing it.

Sean Cross will probably be interested in your fixes and improvements as he incorporated the code in the Google Skywater MPW-8 shuttle.

I was originally going to use Amaranth HDL (nmigen) rather than Verilog, but I decided to try a Verilog version first. I might still be interested in doing an amaranth version sometime.

bunnie commented 1 year ago

Just as a side note for anyone reading along, I did find an interesting bit of documentation on pages 336/337 of the RP2040 manual, that solves the autpull extra cycle mystery. Instead of hoisting the shift in the OSR and pre-computing everything (which creates a horrific critical path), it turns out they do a look-ahead computation of the count, and if it looks like on the next iteration they need to do a pull they trigger the pull simultaneous with the OSR update from the OUT instruction. Basically a small pipelining trick.

Unfortunately adding the look-ahead count output means I had to change the module interfaces, which means a PR will not apply cleanly into the repo as-is, so for now I'm working along a fork of this.

lawrie / fpga_pio

ToDo list or list of differences? #1