Support an Intel FPGA - Githubissues

Potential DSPs: a10, s10, NX

[x] Install Quartus lite on boba
[ ] Install Quartus full on boba
- [x] Apply for access via Intel researcher program
[ ] Look through Quartus, try to find simulation Verilog files for DSPs

3/20/23

Starting to work on this. First step is to install Quartus and try to find the simulation files in there.

I think the free version of Quartus is pretty outdated. I don't think it'll have the latest Intel FPGAs.

I also applied to get access to full Quartus online on Intel's website. Pretty janky, kinda worried my request will just go right into the void.

I'm digging through Quartus files. I've found some formal verification files: intelFPGA_lite/22.1std/quartus/eda/fv_lib/verilog/lpm_mult.v intelFPGA_lite/22.1std/quartus/eda/fv_lib/verilog/altmult_add.v intelFPGA_lite/22.1std/quartus/eda/fv_lib/verilog/altmult_accum.v intelFPGA_lite/22.1std/quartus/eda/fv_lib/verilog/mult_block.v

These correspond to different blocks, e.g. for altmult_add.v: https://www.intel.com/content/www/us/en/docs/programmable/683490/20-3/altmult-add-multiply-adder-ip-core.html

The potentially painful thing about these implementations is that they use generates. We can handle generates, I'm pretty sure, but I think we pretty much have to remove them by hand, which is a bit annoying. We really should build an automated tool for this.

3/21/23

I'm removing generates by hand on altmult_accum. altmult_add seems like it might be "beefier", so we probably want to handle that eventually (though I'm really only going off of line count of the file). For that one, we'll probably need an automated solution.

altmult_accum conversion:

[x] Remove all generates
[x] Convert strings to enum
[ ] Convert params to ports

Just finished removing all the generates. We could definitely write a script or tool to do this. Most of them were of the form

generate if (cond) begin
  <declare flipflop>
  assign <signalname> = <output of ff>;
else
  assign <signalname> = <other signal>;
end
endgenerate

The conversion is:

Move any module instance declaration (i.e. <declare flipflop> above) out of the generate
For any signal assigned in either branch, convert the assignment into a ternary assignment assign <signalname> = <cond> ? <true> : <false>;

One complication while converting params to ports: Some signal widths are determined by parameters. So we need to fix those input sizes, I guess. I'm unsure why they're variable -- the actual hardware unit does have max values for these. Perhaps we should just set them to their max possible value, if we can find that value?

TODO tomorrow: figure out how to set the width parameters. What is width_upper_data?

3/22/23

Some stuff has to remain parameters. Anything that might go into determining a signal width must remain a parameter. Anything that is used as the LHS of a replicate. Other things like that. This is a limitation of Verilog -- I don't think there's any reason we can't convert basically everything to bitvector semantics (including expressions that depend on parameters). Actually, variable-width signals will probably still be a problem.

Another roadblock: need to convert the dffep primitive into a module. It uses a table, which isn't supported by Yosys. I'll have to rewrite by hand and hope it's correct.

Hm. It may actually be better to attack the intelFPGA_lite/22.1std/quartus/eda/sim_lib/altera_mf.v file instead. It's written differently and Yosys may do better with it when trying to convert it to BTOR. By "better", I mean it doesn't seem to use any extra modules or anything.

Making some progress on this file (see uwsampl/lakeroad-private).

When I run:

read_verilog -sv altera_mf_modified.v
prep
proc
hierarchy -simcheck -check -top altmult_accum 
flatten 
clk2fflogic 
write_btor

After my modifications so far, I hit the familiar error:

ERROR: Found topological loop while processing cell $auto$clk2fflogic.cc:236:execute$1081. Active cells:
    $not$altera_mf_modified.v:2060$237
    $auto$clk2fflogic.cc:236:execute$1081
    $auto$clk2fflogic.cc:213:execute$1015
    $auto$clk2fflogic.cc:225:execute$1019v

I'm not sure what I've done in the past when I've hit this error. I may have hit this on some Lattice DSP examples and given up. I'm also pretty sure I've gotten past it before.

This error comes from here: https://github.com/YosysHQ/yosys/blob/23826e5152483c39a617ee1c696c3b48a0c2b756/backends/btor/btor.cc#L248

I think it means: while trying to write out the btor value for an expression, we end up requiring the btor expression itself, making an infinite loop.

If i comment out this line: https://github.com/uwsampl/lakeroad-private/blob/5b9f0233d35c2d80823208bc2c07b62afe35f8e7/altera_mf/altera_mf_modified.v#L2060 the error goes away. Great, progress, but why? And obviously I can't leave it this way. So how to fix this w/o changing the module functionally?

I think it must be because the line flag = ~flag appears twice in that same big always block. Perhaps separating them somehow will fix this? Or moving the changes to flag to separate always blocks with similar sensitivity lists?

It would be nice to know exactly what the topological loop is. I don't understand this well enough to know where it's coming from.

3/23/23

Digging into the topological loop thing today.

If I comment out these lines:

    always @ (posedge flag)
    begin        
        if (extra_accumulator_latency == 0)
        begin
            result   <= result_int[width_result - 1 + int_extra_width : int_extra_width];
            overflow <= overflow_int;
            accum_is_saturated_latent <= accum_saturate_overflow;
        end
        else
        begin
            result_pipe [head_result] <= {overflow_int, result_int[width_result - 1 + int_extra_width : int_extra_width]};
            //mult_is_saturated_pipe[head_result] = mult_is_saturated_wire;            
            accum_saturate_pipe[head_result] <= accum_saturate_overflow;
            head_result               <= (head_result +1) % (extra_accumulator_latency + 1);
            mult_is_saturated_int     <= mult_is_saturated_wire;  
        end

    end

    always @ (negedge flag)
    begin        
        if (extra_accumulator_latency == 0)
        begin
            result   <= result_int[width_result - 1 + int_extra_width : int_extra_width];
            overflow <= overflow_int;
            accum_is_saturated_latent <= accum_saturate_overflow;
        end
        else
        begin
            result_pipe [head_result] <= {overflow_int, result_int[width_result - 1 + int_extra_width : int_extra_width]};
            //mult_is_saturated_pipe[head_result] = mult_is_saturated_wire;            
            accum_saturate_pipe[head_result] <= accum_saturate_overflow;
            head_result               <= (head_result +1) % (extra_accumulator_latency + 1);
            mult_is_saturated_int     <= mult_is_saturated_wire;  
        end

    end

Then it also works. I'm wondering if it's the interaction between the always block that toggles flag, and the always block that is triggered on flag.

It could also be the two always blocks that are triggered on different flag events. What happens if I comment just one of them out?

If either of them is left uncommented, the error still occurs. So I assume it's not the interaction between the two flag-sensitive always blocks.

Easy test: fold the always blocks triggered on flag into the always block that toggles flag. That is, delete the always blocks that are triggered on flag, and move their logic into the always block, where it toggles flag.

Another always-worthy question is: is there a yosys pass we're missing here, that could help us?

I posted a Yosys question here: https://github.com/YosysHQ/yosys/discussions/3711

I tried folding the flag-triggered always blocks into the always block above them, and got the following error:

3.2.5. Executing PROC_ARST pass (detect async resets in processes).
Found async reset \output_wire_clr in `\altmult_accum.$proc$altera_mf_modified.v:2040$233'.
ERROR: Async reset \output_wire_clr yields non-constant value 5'mmmmm for signal \result.

So no luck on that.

A useful and fast reply on the Yosys question, though it doesn't help in this case: https://github.com/YosysHQ/yosys/discussions/3711#discussioncomment-5411199.

Interestingly, if I make the clr synchronous and not asynchronous, the problem also goes away:

diff --git a/altera_mf/altera_mf_modified.v b/altera_mf/altera_mf_modified.v
index 7310329..5de06ae 100644
--- a/altera_mf/altera_mf_modified.v
+++ b/altera_mf/altera_mf_modified.v
@@ -2037,7 +2037,7 @@ module altmult_accum (  dataa,
     // -------------------------------------------------------------
     // This is the main process block that performs the accumulation
     // -------------------------------------------------------------
-    always @(posedge output_wire_clk or posedge output_wire_clr)
+    always @(posedge output_wire_clk)
     begin
         if (output_wire_clr == 1)
         begin

I'm unsure whether this is a good solution or not. I mean, it's very obviously not in the sense that we're changing the functionality of the device. But also we're changing it "minorly" (in a sense). I'm actually not sure how minor the change is b/c I don't know how often the clear signal will be used, or how often we'll rely on the clear signal being asynchronous.

Somewhat related, but we may not need clk2fflogic; we may be able to get away with async2sync; dffunmap instead. Unclear which is better. The second option produces less BTOR output.

Here's my yosys invocation:

yosys -p "
read_verilog -sv altera_mf_modified.v
hierarchy -simcheck -check -top altmult_accum
prep
proc
flatten
async2sync
dffunmap
write_btor"

Next task is to replace the strings with enums, and then convert ports to parameters.

Replacing strings with enums is easy now! I replace the pattern "(\w+)" with PARAM_VALS_ENUM_$1 (in VSCode). This converts all instances of strings into a variant of an enum (which doesn't yet exist). Not all strings necessarily need to be converted, but in my experience, the vast majority do, so undoing by hand the ones that don't is generally not so tedious. Then, I run the following command:

grep -E "PARAM_VALS_ENUM_\w+" altera_mf_modified.v -o | sort | uniq

This gets the unique variants of the enum, which I can then copy and paste to declare the enum in the Verilog file.

Converting ports to parameters is harder. I do it methodically now. I generally will convert a batch at a time, and continually try converting the module to BTOR, as some parameters will inevitably make BTOR conversion (at least with Yosys) impossible. Some parameters that can't be converted to inputs:

Any parameter that determines the sizing of a signal.

Realization: conversion to BTOR can create a bunch of unnamed inputs. I think these are at least partially related to the use of Z values, but I'm also realizing it may also be for registers that don't have initial values.

Modification: Handling non-constant init values. The error reads: ERROR: Failed to get a constant init value for \sign_a_reg:. You need to make the init value constant. Does the init value matter? Can you guarantee that you'll always overwrite the init value with useful data?

See this modification:

@@ -876,17 +876,24 @@ module altmult_accum (  dataa,
         sload_upper_data_pipe_reg = 0;

         //sign_a_reg  = (signa ===1'bz)   ? ((representation_a == PARAM_VALS_ENUM_SIGNED) ? 1 : 0) : 0;
-        sign_a_reg  = ((representation_a == PARAM_VALS_ENUM_SIGNED) ? 1 : 0);
+        // Have to simplify this init value, now that representation_a is a port and not a parameter. Similar modifications follow.
+        // sign_a_reg  = ((representation_a == PARAM_VALS_ENUM_SIGNED) ? 1 : 0);
+        sign_a_reg  = 0;
         //sign_a_pipe_reg = (signa ===1'bz)   ? ((representation_a == PARAM_VALS_ENUM_SIGNED) ? 1 : 0) : 0;
-        sign_a_pipe_reg = ((representation_a == PARAM_VALS_ENUM_SIGNED) ? 1 : 0);
+        // sign_a_pipe_reg = ((representation_a == PARAM_VALS_ENUM_SIGNED) ? 1 : 0);
+        sign_a_pipe_reg = 0;
         //sign_b_reg  = (signb ===1'bz)   ? ((representation_b == PARAM_VALS_ENUM_SIGNED) ? 1 : 0) : 0;
-        sign_b_reg  = ((representation_b == PARAM_VALS_ENUM_SIGNED) ? 1 : 0);
+        // sign_b_reg  = ((representation_b == PARAM_VALS_ENUM_SIGNED) ? 1 : 0);
+        sign_b_reg  = 0;
         //sign_b_pipe_reg = (signb ===1'bz)   ? ((representation_b == PARAM_VALS_ENUM_SIGNED) ? 1 : 0) : 0;
-        sign_b_pipe_reg = ((representation_b == PARAM_VALS_ENUM_SIGNED) ? 1 : 0);
+        // sign_b_pipe_reg = ((representation_b == PARAM_VALS_ENUM_SIGNED) ? 1 : 0);
+        sign_b_pipe_reg = 0;
         //addsub_reg  = (addnsub ===1'bz) ? ((accum_direction == PARAM_VALS_ENUM_ADD)     ? 1 : 0) : 0;
-        addsub_reg  = ((accum_direction == PARAM_VALS_ENUM_ADD)     ? 1 : 0);
+        // addsub_reg  = ((accum_direction == PARAM_VALS_ENUM_ADD)     ? 1 : 0);
+        addsub_reg  = 0;
         //addsub_pipe_reg = (addnsub ===1'bz) ? ((accum_direction == PARAM_VALS_ENUM_ADD)     ? 1 : 0) : 0;
-        addsub_pipe_reg = ((accum_direction == PARAM_VALS_ENUM_ADD)     ? 1 : 0);
+        // addsub_pipe_reg = ((accum_direction == PARAM_VALS_ENUM_ADD)     ? 1 : 0);
+        addsub_pipe_reg = 0;

         result_int      = 0;
         result          = 0;

The thing to check was whether the init value of these signals mattered; that is, would they ever actually be read? The honest answer is that I'm not 100% sure, but I did check to make sure they're set somewhere, generally in an always block that seems like it'll always fire.

In this case, I think there's a weird circumstance where the init value could be read: this block is complicated, and has multiple clocks, and different registers can be optionally controlled by different clocks (interesting thing, not sure why). So if we do something weird where we control the signal with one clock, but we never run that clock, then its init value will be read. But if we hook up everything to the same clock, we should be good.

This is hard! There's a lot to keep track of. You can't just make all parameters into ports, because there's a lot of Verilog ops that can't handle non-constants in one of the arguments (e.g. replication).

3/24/23

I was wondering last night: am I on the right path? Is there some other, automated way to do this that would actually be feasible to implement in the time we have?

I have something importing now.

Two things I had to do:

Had to support setting parameters in the Verilog-to-Racket script, new argument --parameter.

Had to remove instances of Z values; at least one of them was getting translated to invalid Racket ((bv #f (bitvector 1))).

Next is to try out the Racket version. Can we write a test that uses them?

I'm fighting with getting a test working. I'm unsure how many times to run the clock. I see a bvmul expression in there, but it's nested in a bunch of ites that I guess are unsatisfiable.

I guess I can try hard-coding values.

Maybe part of the problem is the is_stratixXXX values? These are registers that aren't actually set. They shouldn't be problems, because they should be set to symbolic values and then the solver can set them as needed. But that may not actually happen...

Still struggling with this at the end of the day.

Plan for debugging the next time I log on:

Currently, the test runs the clock twice (or maybe three times) and then asserts that the output is equal to bvmul.

Figure out which parameters can be safely set to some concrete constant by setting them to constants and seeing if the bvmul expression we're trying to get to still ends up in the output.
Once we've filtered those out, either make symbolic constants for every single invocation of the module and use them on each iteration and then solve (if the number of symbolic constants is low) or do the same solving manually.

3/25/23

A bit disheartening: I tried some very open ended synthesis problems on the Racket model for altmult_accum, and i can't get anything interesting out of it. For example, I run it for two cycles, with symbolic inputs each cycle, and I ask Rosette to find any setting of the symbolics that makes the output nonzero...and it can't find anything!

To me, this is telling me that the imported model is wrong, meaning it was either wrong to begin with (which we could check with Verilator if we wanted) or (more likely) that my modifications broke it. I'm not 100% sure that's the case quite yet though.

One way forward would be to re-do the import and make very minimal changes, and see if the resulting model works.

Another thing to check would be to see whether the fact that we're setting the bitwidth explicitly is breaking things. We're setting it to 18bit inputs, 18bit output, when the default is (for some reason) 2bit inputs and 5 bit output (maybe 2+2+1 sign bit?).

3/26/2023

Taking a new strategy: start again from the start, make the minimum modifications necessary, see if we can even get a basic version working.

Replaced tri0, tri1, supply0 with logic.
Removed init block logic that checks for valid setup and exits if things aren't valid.
Changed flag from int to logic
Removed some references to Z
Changed always @ (posedge flag or negedge flag) to always @ (flag)

And it still doesn't work. It could be that there are problems with my testing code, or that I'm not running it for enough cycles. But it seems like the multiply is there in the resulting expression, and we just can't "get it out" because it's nested within a bunch of unsatisfiable if statements.

Something does seem wrong with my testing code, though, because I literally can't get the module to produce ANYTHING but 0.

I wonder if it's best now to go back to importing the other model?

Doing some digging. IF we run the Racket model twice with all symbolic inputs, we get an expression like:

 (ite (bveq (bv #b0 1) (bvor aclr3$24 aclr3$120))
      (ite (bveq (bv #b10 2) (concat clock0$126 clock0$30))
           (ite (bveq (bv #b0 1) ena0$43)
                (bv #b00000 5)
                (ite (bveq (bv #b0 1)
                           (ite (bveq (bv #b0 1)
                                      (bvor (ite (bveq (bv #b0 1) clock0$30) (bv #b1 1) (bv #b0 1))
                                            (ite (bveq (bv #b0 1) addnsub$25) (bv #b1 1) (bv #b0 1))))
                                (ite (bveq (bv #b0 1) aclr3$24) (bv #b1 1) (bv #b0 1))
                                (bv #b1 1)))
                     (bvneg (ite (bveq (bv #b0 1) ena0$43) unnamed-input-417$93 (bv #b00000 5)))
                     (ite (bveq (bv #b0 1) ena0$43) unnamed-input-417$93 (bv #b00000 5))))
           (bv #b00000 5))
      (bv #b00000 5))

If you stare at this, you can figure out that there's no way to not end up hitting one of the (bv #b00000 5) cases. It's because no matter what you set ena0$43 to, you end up hitting the 0 case. That seems okay...it seems here that we just need to run the model a few more times to get to an interesting spot. I'm not sure if there's anything interesting to be taken from this example, in that case.

One thing that could be happening here is that we just don't know what the right way to toggle the signals is. In the past, we just need to toggle clock up, toggle it down, up, down, etc, to step the design along. But here, it may be more complicated.

I'm thinking I should just go back and see how close I was to importing the other one.

Tried a few things. Tried importing the other model. Tried importing the other model with very limited modifications. Tried importing altmult_add with limited (basically zero) modifications. In all cases, hitting the same issue -- provably, the output is always zero.

Perhaps the next thing to do (if possible) is to stare at the model for altmult_accum or altmult_add and figure out when exactly it should produce nonzero output. Perhaps I should even try simulating with Verilator to confirm.

I'd really prefer to do this with the simpler simulation model!

Doing some debugging now. I'm basically shrinking the output that I'm trying to get the module to output and checking whether the resulting Racket can output the value I want. I've shrunk it down so the module just outputs the registered value for the input a, and I still can't get that signal to be anything other than zero. After calling the module twice, with all symbolic inputs, I get an expression:

(ite (bveq (bv #b0 1) (bvor aclr3$24 aclr3$120))
     (ite (bveq (bv #b10 2) (concat clock0$126 clock0$30))
          (ite (bveq (bv #b0 1) (bvand clock0$30 ena0$43)) (bv #b00 2) dataa$38)
          (bv #b00 2))
     (bv #b00 2))

You can see that this expression might evaluate to dataa, though if you actually stare at it, you'll see it actually can't!

We need the innermost ite to take its then branch, so we need 0 == clock0$30&&ena0$43 to be false, so we need both of those signals to be true.

We need the next innermost ite to take its if branch, so we need clock0$30 to be 0. Therein lies our contradiction. For some reason, we need the clock to be both 0 and 1. Something seems wrong about this, though.

Here is the corresponding Verilog (I think):

    always @(posedge input_a_wire_clk or posedge input_a_wire_clr)
    begin
        if (input_a_wire_clr == 1)
            mult_a_reg <= 0;
        else if ((input_a_wire_clk == 1) && (input_a_wire_en == 1))
        begin
            if (input_source_a == "DATAA")
                mult_a_reg <= (int_width_a == width_a) ? dataa : {dataa, {(diff_width_a){1'b0}}};
...

I feel like I'm close on this.

A possible fix: Separate the always block into two duplicates of itself. Remove the if statement.

If that worked, I'm not sure I would understand why it worked.

It seems like the clock value on this line:

        else if ((input_a_wire_clk == 1) && (input_a_wire_en == 1))

is somehow getting pulled "from the past". I.e. that should be clock0$126, but it's clock0$30.

It's as if the value is getting registered, and we're getting the registered value from the previous cycle.

The problem might be with the fact that I'm using clk2fflogic in Yosys. That pass may be doing more than what I want it to.

Okay, that seemed to help things. Trying on the full design...

Success! Was able to at least get it to produce a nonzero output.

To summarize what the problem was: The problem was actually in the Verilog to Racket Python script, where we were using clk2fflogic, which breaks the btor when there are multiple clocks. I'm not sure exactly why it breaks other than roughly that clk2fflogic combines a bunch of clocks into a single global clock.

Next things to do:

[ ] Go back and try converting ports to parameters, or just try importing the file where we already converted ports to parameters.
[ ] Figure out whether we actually need async2sync.

Summary of where we're at:

Imported a model of altmult_accum, which is the mult-accum DSP on some Intel architecture (the model actually spans multiple architectures, which is part of what made it hard to import)

What we need next:

Ideally, import other DSP primitives (altmult_add is a good place to start)

Main things I learned:

I was using a "bad" series of commands for btor conversion in yosys -- namely, clk2fflogic was doing more than what we wanted it to do. It was combining all clocks into one global clock, which created weird bugs. I fixed it by using...some other Yosys pass that wasn't clk2fflogic. dffunmap maybe?
We still need to convert all ports to parameters to be able to do btor conversion with Yosys, but that has a whole host of problems. One of the main ones is that any parameter that determines the width of a signal cannot be converted to a port.
generates also don't work well for us. We have to remove them. There is a straightforward procedure for doing this, though.
There may be multiple models for a given primitive on a given architecture, hidden within the sprawling directory tree of a tool like Quartus. It's probably best to try and find all of them (most surefire way is by grepping for module definitions, ie module module_you_are_looking_for). One may import more easily than another.
We still don't support importing tables in Verilog
One of the most common roadblocks to importing is topological loops. In our case, making a reset synchronous and not async was the key to fixing, but of course it also makes the model slightly incorrect. It would be great to deeply understand these loops and come up with a general solution.

Whoops, reopening until we merge #211

I'm coming back around to this. I'm looking at this block of code that I looked at before:

    always @(posedge input_a_wire_clk or posedge input_a_wire_clr)
    begin
        if (input_a_wire_clr == 1)
            mult_a_reg <= 0;
        else if ((input_a_wire_clk == 1) && (input_a_wire_en == 1))

It seems like yosys is interpreting the read of input_a_wire_clk on the last line as "coming from the past". But also, that if statement doesn't make much sense anyway -- of course the clock is 1, this is within a posedge triggered block, and we've already checked that clr==0! So can we just remove that part of the and?

That works! We have a much more correct model of the DSP now.

gussmith23 / lakeroad

Support an Intel FPGA #170