Optimizing circuits for speed (and High-Z state on FPGAs)

nevercast commented 6 years ago

Hello, a friend and I are working on a CPU (I'm writing the assembler, he the digital design), we are looking to speed up our ALU, currently it runs at around 100khz and while this is very fast for what it is, if we can squeeze more speed out that would be awesome.

I've looked through the source code of digital and it looks like everything in the /switching tree modifies the model when its switched. Is this an expensive process? If we were to start using Wired-AND/Wired-OR instead of using gates, are there some cases where this may evaluate faster than using the gates provided?

My thought is that if the model doesn't change, then the FETs may simplify the model for several clock cycles while parts of the circuit are effectively disconnected (This is my understanding of the model), then when the FETs are enhanced, you join the wire nets, essentially attaching more circuit to the model. Is it likely we will see a speed increase using this model switching (Since during some clock cycles, the model will be simpler than others), or do you expect the overhead of modifying the model is going to be a large expense?

We intended to profile this, but I thought I'd reach out to see if anyone has experience with this before we redesign our ALU and compare two versions of it.

hneemann commented 6 years ago

Adding switches (FETs, relays) is a bad idea because the joining of the nets if a switch is closed is an expensive operation and it becomes even more expensive if there are a lot of switches. The number of switches does not scale very well (it depends on the circuit, but it may be O(n²)). But the good news are: The model does not become slower if a lot of components are used. The model only becomes slower if a lot of components detect a change in one of their input values. So if you want to improve the speed of your ALU it's more efficient, to fix the inputs to zero (or any other value that doesn't change). And modify the inputs only in the case the output is really needed. But keep in mind: the signal that is connected to a lot of components is the clock signal. And extreme care must be taken if the clock signal is delayed by any kind of logic. See this issue as an example. Which makes it difficult to turn off the clock signal in parts of the circuit.

nevercast commented 6 years ago

Ah. Thanks for the extra detail around clock delay, after reading that recomputing the model was expensive I immediately assumed that gating the clock would be a good alternative but it seems that is not the case.

Does every component have the same delay? Is there the concept of delay in clock time or are the results just processed in the next "time step" ? I would assume the wire states propagate infinitely until they are either stable or an oscillation is detected. Does the clock wait for the circuit to be stable or will it trigger on a timer?

hneemann commented 6 years ago

Ah. Thanks for the extra detail around clock delay, after reading that recomputing the model was expensive I immediately assumed that gating the clock would be a good alternative but it seems that is not the case.

You can gate the clock by an AND gate. I do not expect any problems because a AND gate only adds a single gate delay and this should work in almost all practical cases. But turning the clock on and off may be difficult because this may cause much larger delays. So the difficult scenario is: the clock goes high, the clock gate blocks the clock signal, than somehow some gate delays later the clock gate propagates the clock signal with a much larger delay. And meanwhile some signals have already changed while you expect them unchanged at the rising edge of the clock. Or the clock goes high and the AND gate is open. Than, a view gate delays later the AND gate blocks which causes a very short clock pulse, while you expect there is no clock pulse at all. Depending on the logic which controls the clock gate it is also possible to see hazards on the gated clock line, which also can cause serious issues. Gating the clock can be done, but it's more complex than it seems in the first place.

Does every component have the same delay? Is there the concept of delay in clock time or are the results just processed in the next "time step" ?

Every component has the same gate delay. The results are just processed in the next time step.

I would assume the wire states propagate infinitely until they are either stable or an oscillation is detected.

You are right: The wires and also the splitters don't have any delay.

Does the clock wait for the circuit to be stable or will it trigger on a timer?

The clock waits for the circuit to be stable. With the simple clock element you can't "overclock" your circuit. If you want to build a circuit where the clock changes before the circuit has stabilized, you have to use the asynchronous timing mode. As an example, see the muller-pipeline included in the examples folder (examples/sequential/async/muller-pipeline.dig).

The simulator should work as fast as possible. So I've decided to add just enough complexity to point out hazards or to build a differentiator to generate clock pulses from clock edges. Adding more complexity would make the simulator slower without adding functions that I need in my lectures.

nevercast commented 6 years ago

Thank you very much for these replies, I think I've a good mental model of how the simulator works now and the constraints/features therein. One last thing I wish to clarify, bidirectional busses can be achieved between subcircuits because you say the model is flat. Meaning the wire nets between subcircuits are essentially joined in one circuit at runtime. I however should not use FETs for the switching because these will recompute the model, so I presume the Driver components under the Wire category is the correct way to achieve fast bidirectional transfer? (For shared peripheral busses)

hneemann commented 6 years ago

One last thing I wish to clarify, bidirectional busses can be achieved between subcircuits because you say the model is flat. Meaning the wire nets between subcircuits are essentially joined in one circuit at runtime.

Yes the model is flat. Even if a embedded circuit is used several times, the final model contains all the components several times. This is possible because adding a lot of components does not slow down the model execution.

I however should not use FETs for the switching because these will recompute the model, so I presume the Driver components under the Wire category is the correct way to achieve fast bidirectional transfer? (For shared peripheral busses)

You are right. The fastest way to put data on a bus is a driver or a muxer. If you are planing to run your processor on an fpga you should go for the muxer, because there is no verilog/vhdl export for drivers. If you don't want to run the processor on an fpga, drivers are fine.

nevercast commented 6 years ago

Thank you, you've been exceptionally helpful. I think this is almost everything. One thing I am able to achieve with Drivers is a floating output, presumably this is a High-Z state. Can this be achieved with Muxers? I see I cannot leave an input unattached on a mux. Can you suggest a solution for leaving many wires of a multi-bit bus in a high-z state so that other devices can pull the lines while still being FPGA "friendly" ?

hneemann commented 6 years ago

That's an interesting question. I'm far away from being an fpga expert. To be honest I'm more of an fpga beginner. ;-) But from my understanding (and maybe I'm wrong) there is no such thing like an high-z state in between the CLB's of an fpga. Drivers to create a high-z state only exists in the IO blocks which are used to drive the physical fpga pins. Thus it's not possible to create a high-z bus inside of an fpga. And that's not a problem because you can always replace a high-z bus structure by a muxer structure, if you are aware of all components which are attached to the bus, which is of course the case if a hdl model is synthesized for an fpga:

fpga_highz

And also note the following: The circuit on the left can easily be destroyed if two of the enable signals go high. In the circuit at the right that is impossible. And if you are familiar with this, you can create your enable signals in a more appropriate way and even omit the priority encoder.

If someone knows better: Please leave a comment!

nevercast commented 6 years ago

Thank you for the example, I too am an FPGA beginner so I cannot answer the question myself. I've updated the issue title to hopefully catch someones eye and they may be able to provide an answer. I'll leave the issue open for the same reason if that is okay with you.

I was initially designing my subcircuits to be muxer agnostic but still putting the enable lines inside the subcircuit (which was enabling drivers), I see that if I make the enable lines the responsibility of some controller hardware with muxes then the device need not have a high-z state and can always output a strong value. I will redesign my subcircuits in similarity to your right-hand example.

Thanks for all your help! Back to CPU development I go. Josh.

hneemann commented 6 years ago

I looked at the publicly available reverse engineered documentation of the Lattice iCE40 FPGAs (Project IceStorm) The wires which are used to connect the logic blocks can't be in a high-z state after being enabled. To be more precise: The wires are driven by unidirectional tri-state buffers, and the enable signal of the buffer is controlled by a bit in the bit stream the device is programmed with. The enable signal can't be controlled by user defined logic. Thus a high-z bus is not possible inside of an FPGA of the Lattice iCE40 series. And I expect that's the case with other FPGAs too. The tools that generate the bit stream can thus ensure at a very low level that only one of the drivers is enabled.

nevercast commented 6 years ago

Thanks for looking in to this.

mixotricha commented 3 years ago

What I did to do a bidirectional port out of verilog code emulating a Z80 ended up looking like this,

whole_example

Thought complete working example might be more instructive than just a small piece. This shows how I did the data in and data out to the data bus on the Z80 and also the address lines. Perhaps the question following on from this is what that might look like in actual TTL on the board around an FPGA.

hneemann commented 3 years ago

@mixotricha If you want to connect a circuit inside of an FPGA with the outside world using a High-Z bus, this is possible to model in Digital. There is a Pin Control component in the misc menu that allows this. In HDL generation, this component is implemented by an inout port.

mixotricha commented 3 years ago

In my circuit I did that to. I used the pin control in combination with a bidirectional splitter. This is outside of the Verilog code that is loaded that simulates the Z80. This setup works really well. But then I was wondering about actually implementing it in real hardware. I tried using some latches to emulate what the combination of the bidirectional splitter does but had not much success. In my example I would be putting the verilog code that simulates the Z80 inside an FPGA and then building external hardware to manage the bus around that. But perhaps a better way would be to build the bus management in to the verilog itself? That is to move the logic I currently have that deals with the bidirectional splitting in to the FPGA. I also hope somebody else may come across my example and find it useful for getting this sort of simulation done. I'd be happy to bundle it up and contribute it to examples somewhere.

hneemann / Digital

Optimizing circuits for speed (and High-Z state on FPGAs) #184