phanrahan / magma

magma circuits
Other
242 stars 23 forks source link

Magma/CoreIR Simulator Not Propagating Value Through Wire #287

Closed David-Durst closed 5 years ago

David-Durst commented 5 years ago

I have a circuit with an output of one instance wired to the input of another instance. However, when the output is true, the input is false. The code for printing the definition containing the two instances is: https://github.com/David-Durst/aetherling/blob/ed4a632ef935dc79fe991677300a8e092ed9df4b/tests/test_native_2d_line_buffer.py#L386 and for checking their values is : https://github.com/David-Durst/aetherling/blob/ed4a632ef935dc79fe991677300a8e092ed9df4b/tests/test_native_2d_line_buffer.py#L390-L391 In this code, I have an instance first_sipo. (a SIPO is a serial in parallel out shift register.) I get the definition containing first_sipo using the line first_sipo.defn. This definition has multiple child instances, including inst0 being the SIPO and inst1 being a row buffer that I also call a DelayedBuffer. On lines 390 and 391, I'm printing the output of the SIPO and the input of the DelayedBuffer.

Below is the output of the code. It first shows that the output of the SIPO is connected to the input of the delayed buffer with the line wire(inst0.O[2], inst1.I[0]). It next shows that the output of the SIPO is True but the input to the DelayedBuffer is False. first sipo output 6: True and first rb input 6: False. I think this is a bug. Am I doing something wrong? Is this a bug?

shift_registers_for_shift_registers_for_AnyDimensionalLineBuffer_Bittype__1_1_pxPerClock__3_3_window__6_8_img__1_1_stride__0_0_origin_with__1_1__pxPerClock__6_8__imgSizes__3_3__neededCoordinates_True_lastInDimension_with__1__pxPerClock__8__imgSizes__3__neededCoordinates_False_lastInDimension = DefineCircuit("shift_registers_for_shift_registers_for_AnyDimensionalLineBuffer_Bittype__1_1_pxPerClock__3_3_window__6_8_img__1_1_stride__0_0_origin_with__1_1__pxPerClock__6_8__imgSizes__3_3__neededCoordinates_True_lastInDimension_with__1__pxPerClock__8__imgSizes__3__neededCoordinates_False_lastInDimension", "I", In(Bit), "O", Array(3,Out(Bit)), "next", Out(Bit), "CLK", In(Clock), "CE", In(Enable))
inst0 = SIPO_Bitt_3n_0init_TrueCE_RESET()
inst1 = DelayedBuffer_Bitt_5n_1k_5emittingPeriod_5initialDelay()
inst2 = Term_Bitt()
wire(shift_registers_for_shift_registers_for_AnyDimensionalLineBuffer_Bittype__1_1_pxPerClock__3_3_window__6_8_img__1_1_stride__0_0_origin_with__1_1__pxPerClock__6_8__imgSizes__3_3__neededCoordinates_True_lastInDimension_with__1__pxPerClock__8__imgSizes__3__neededCoordinates_False_lastInDimension.I, inst0.I)
wire(shift_registers_for_shift_registers_for_AnyDimensionalLineBuffer_Bittype__1_1_pxPerClock__3_3_window__6_8_img__1_1_stride__0_0_origin_with__1_1__pxPerClock__6_8__imgSizes__3_3__neededCoordinates_True_lastInDimension_with__1__pxPerClock__8__imgSizes__3__neededCoordinates_False_lastInDimension.CLK, inst0.CLK)
wire(shift_registers_for_shift_registers_for_AnyDimensionalLineBuffer_Bittype__1_1_pxPerClock__3_3_window__6_8_img__1_1_stride__0_0_origin_with__1_1__pxPerClock__6_8__imgSizes__3_3__neededCoordinates_True_lastInDimension_with__1__pxPerClock__8__imgSizes__3__neededCoordinates_False_lastInDimension.CE, inst0.CE)
wire(inst0.O[2], inst1.I[0])
wire(shift_registers_for_shift_registers_for_AnyDimensionalLineBuffer_Bittype__1_1_pxPerClock__3_3_window__6_8_img__1_1_stride__0_0_origin_with__1_1__pxPerClock__6_8__imgSizes__3_3__neededCoordinates_True_lastInDimension_with__1__pxPerClock__8__imgSizes__3__neededCoordinates_False_lastInDimension.CE, inst1.WE)
wire(shift_registers_for_shift_registers_for_AnyDimensionalLineBuffer_Bittype__1_1_pxPerClock__3_3_window__6_8_img__1_1_stride__0_0_origin_with__1_1__pxPerClock__6_8__imgSizes__3_3__neededCoordinates_True_lastInDimension_with__1__pxPerClock__8__imgSizes__3__neededCoordinates_False_lastInDimension.CLK, inst1.CLK)
wire(shift_registers_for_shift_registers_for_AnyDimensionalLineBuffer_Bittype__1_1_pxPerClock__3_3_window__6_8_img__1_1_stride__0_0_origin_with__1_1__pxPerClock__6_8__imgSizes__3_3__neededCoordinates_True_lastInDimension_with__1__pxPerClock__8__imgSizes__3__neededCoordinates_False_lastInDimension.CE, inst1.CE)
wire(inst1.valid, inst2.I)
wire(inst0.O, shift_registers_for_shift_registers_for_AnyDimensionalLineBuffer_Bittype__1_1_pxPerClock__3_3_window__6_8_img__1_1_stride__0_0_origin_with__1_1__pxPerClock__6_8__imgSizes__3_3__neededCoordinates_True_lastInDimension_with__1__pxPerClock__8__imgSizes__3__neededCoordinates_False_lastInDimension.O)
wire(inst1.O[0], shift_registers_for_shift_registers_for_AnyDimensionalLineBuffer_Bittype__1_1_pxPerClock__3_3_window__6_8_img__1_1_stride__0_0_origin_with__1_1__pxPerClock__6_8__imgSizes__3_3__neededCoordinates_True_lastInDimension_with__1__pxPerClock__8__imgSizes__3__neededCoordinates_False_lastInDimension.next)
EndCircuit()
step 6: False
input 6: [False]
first sipo input 6: False
first sipo output 6: True
first rb input 6: False
first rb output 6: False
lb output 6: [[[[False, False, False], [False, False, False], [True, False, False]]]]
lb valid 6: False
output 6: [[[False, False, False], [False, False, False], [True, False, False]]]
leonardt commented 5 years ago

Can you give me some more info to reproduce:

upon inspection of the printed magma circuit copied above, it does seem that wire(inst0.O[2], inst1.I[0]). We could check the result of compiling the circuit coreir to make sure the compiled coreir also shows that. Then that would suggest the issue lies somewhere in the simulator frontend or backend (that is, it would rule out an issue in the compilation process).

dillonhuff commented 5 years ago

I'm not that familiar with aetherling or the magma API for reading internal values so I can't make heads or tails of the code you linked to.

A few different things could cause a mismatch between the values of two connected ports:

A few questions to narrow things down:

Is the output of the whole module that you are testing correct?

Is first SIPO output 6 supposed to be True? I understand that the problem you are reporting is that first SIPO output 6 and first rb input 6 are inconsistent, but it would be useful to know what value they are supposed to be in this test case.

Have you unit tested the SIPO and DelayedBuffer modules in isolation?

David-Durst commented 5 years ago

@leonardt - I'm using the master branches for Mantle, Magma, and CoreIR. I'm using LineBufferNative for Aetherling. To run the test, run the following command in the top folder of the Aetherling repository:

pytest -s tests/test_native_2d_line_buffer.py::test_2D_bit_line_buffer_1_1_3_3_6_8_1_1_0_0

@dillonhuff :

The output of the whole module is incorrect. It's failing the test.

I previously passed all 21 tests in the file using the same SIPO implementation, so I don't think that's the problem.

DelayedBuffer passes the unit tests for it at https://github.com/David-Durst/aetherling/blob/LineBufferNative/tests/test_delayed_buffer.py. Also, I have used a similar version of DelayedBuffer to pass all 21 tests in the past. However, I have changed DelayedBuffer since there, so it is possible that the problem is there. However, it's unlikely given the combination of the debugging output and the fact that the current implementation of DelayedBuffer works for a number of other tests in the linked file (https://github.com/David-Durst/aetherling/blob/LineBufferNative/tests/test_native_2d_line_buffer.py).

David-Durst commented 5 years ago

@dillonhuff : both the output of SIPO and the input to the DelayedBuffer are supposed to be true on step 6. To implement a line buffer, you need a series of registers connected to a rowbuffer to store a row of the images. The SIPO is the series of registers and its trying to hand off the True value to the rowbuffer.

leonardt commented 5 years ago

Wild guess: I've seen a similar issue when connecting stateful elements in the Python simulator, where the order of evaluation was incorrect because the topological sorting was incorrect. @dillonhuff I have nothing backing this besides anecdotal evidence, but is there a chance that there could be some issue in topologically sorting this circuit? Is there an easy way of seeing how the simulator has sorted the flattened circuit that we could inspect?

leonardt commented 5 years ago

Okay, with the following changes I managed to dump the json file

❯ git diff
diff --git a/tests/test_native_2d_line_buffer.py b/tests/test_native_2d_line_buffer.py
index 6989179..cba848c 100644
--- a/tests/test_native_2d_line_buffer.py
+++ b/tests/test_native_2d_line_buffer.py
@@ -309,6 +309,8 @@ and run it on test data sets."""
     cirb = CoreIRBackend(c)
     scope = Scope()
     LineBufferDef = DefineTwoDimensionalLineBuffer(cirb, **parameters.as_kwargs())
+    cirb.compile(LineBufferDef)
+    cirb.modules[LineBufferDef.name].save_to_file("tmp.json")
     window_count, parallelism, valid_count = parameters.internal_parameters()
     window_x = parameters.window_x
     window_y = parameters.window_y

tmp.json.txt

On line 800, I see the connection

          ["inst1.I.0","inst0.O.2"],

within the definition of shift_registers_for_shift_registers_for_AnyDimensionalLineBuffer_Bittype__1_1_pxPerClock__3_3_window__6_8_img__1_1_stride__0_0_origin_with__1_1__pxPerClock__6_8__imgSizes__3_3__neededCoordinates_True_lastInDimension_with__1__pxPerClock__8__imgSizes__3__neededCoordinates_False_lastInDimension so I think it's compiling fine.

leonardt commented 5 years ago

Okay, so I'm looking at the Python test bench code, and it seems like it just sets the clock enable, then does a bunch of evaluate/advance loops while monitoring the simulator values.

In the case of the above mentioned issue with the inconsistent values. This comes after a sim.evaluate() call, so I would expect the that the values have been propagated (in terms of the simulator semantics) so if I call get_value on two connected ports, I should get the same value. In this case, we don't.

So, I suspect it must be an issue either in the magma API to get the simulator values or in the simulator

David-Durst commented 5 years ago

Just in case this was a propagation issue, I also tried adding a sim.evaluate() between sim.get_value(first_sipo.defn._instances[1].I[0], delayed_buffer_of_first_shift_register) and sim.get_value(first_sipo.defn._instances[1].CE, delayed_buffer_of_first_shift_register). That didn't help. The simulator has already tried propagating and failed.

leonardt commented 5 years ago

Actually, a third possibility is that the call to get_value is wrong (maybe the scope is wrong?) Not sure what the best way to check this is.

David-Durst commented 5 years ago

If the scope is wrong, the simulator seg faults. Thus, I know the scope is right. get_value may be wrong, but the scope being handed to it is correct.

leonardt commented 5 years ago

To see if I could find anything funky the magma API, I made the following change to print out the coreir select path it creates for the get_value call

diff --git a/magma/simulator/coreir_simulator.py b/magma/simulator/coreir_simulator.py
index c796d14..7725e2c 100644
--- a/magma/simulator/coreir_simulator.py
+++ b/magma/simulator/coreir_simulator.py
@@ -194,6 +194,7 @@ class CoreIRSimulator(CircuitSimulator):
             return r
         else:
             insts, ports = convert_to_coreir_path(bit, scope)
+            print(insts, ports)
             bools = self.simulator_state.get_value(insts, ports)
             if len(bools) == 1:

Then for the tests I see:

['inst0', 'inst0', 'inst0', 'inst0', 'inst0', 'inst0'] ['O.2']
first sipo output 6: True

and

['inst0', 'inst0', 'inst0', 'inst0', 'inst1', 'inst1'] ['I.0']
first rb input 6: False

Do those select paths seem reasonable? It seems like they're descending into different scopes?

leonardt commented 5 years ago

Actually that may be the flattened select path, so it could be okay if they're descending into the register primitives contained within the respective modules

David-Durst commented 5 years ago

The paths seem suspect. Looking into it. What are those paths relative to? The defintiion passed to the simulator?

leonardt commented 5 years ago

I believe so

leonardt commented 5 years ago

This is probably the code of interest: https://github.com/phanrahan/magma/blob/877eb1fe33cf87548b141afb02ba06ffd1c62163/magma/simulator/coreir_simulator.py#L26

David-Durst commented 5 years ago

I tried running another test with more Trues. Sometimes the input to the DelayedBuffer is True, but it seems independent of the output of the SIPO. Only one thing can be wired to the DelayedBuffer, right?

leonardt commented 5 years ago

Only one driver (output) can be wired to an input. You could inspect the output coreir or magma repr to verify this, but both magma and coreir enforce this type check.

David-Durst commented 5 years ago

The paths in comment https://github.com/phanrahan/magma/issues/287#issuecomment-421516058 look wrong. This linebuffer is 3x3 window, so it has 3 rows. The SIPO with path 'inst0', 'inst0', 'inst0', 'inst0', 'inst0', 'inst0'] ['O.2'] is for the first row and the DelayedBUffer with path ['inst0', 'inst0', 'inst0', 'inst0', 'inst1', 'inst1'] ['I.0'] is for the second row.

leonardt commented 5 years ago

I'm not sure I understand. The paths look wrong, what should they look like?

David-Durst commented 5 years ago

They should be for SIPO

['inst0', 'inst0', 'inst0', 'inst0', 'inst0', 'inst0'] ['O.2']

and for DelayedBuffer

['inst0', 'inst0', 'inst0', 'inst0', 'inst0', 'inst1'] ['I.0']

Note that the second to last value for the DelayedBuffer should be inst0 and not inst1. This entry in the path selects which row of the Linebuffer. The inst0 is row 0, inst1 is row 1.

David-Durst commented 5 years ago

Reflecting on it more, the paths look incomplete rather than wrong. It looks like they are both missing an extra inst0 at the start.

This might be because I skip one level of scopes in https://github.com/David-Durst/aetherling/blob/LineBufferNative/tests/test_native_2d_line_buffer.py#L371. The code at https://github.com/phanrahan/magma/blob/877eb1fe33cf87548b141afb02ba06ffd1c62163/magma/simulator/coreir_simulator.py#L45 walks the scope parent list and it appears that last parent is missing because I skip it. That would lead to missing an inst0 at the start of the string.

The parent of test2_scope should be test1_scope. However, when I do that, the simulator segfaults at the first use of the nested scopes (https://github.com/David-Durst/aetherling/blob/LineBufferNative/tests/test_native_2d_line_buffer.py#L389) with the following message:

Assertion failed: (isNumber(lastSelStr)), function getValueByOriginalName, file interpret.cpp, line 1583.
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

I only get this far because I skip one layer of parents. Is that a bug?

leonardt commented 5 years ago

I think this is what you're referring to

❯ git diff
diff --git a/tests/test_native_2d_line_buffer.py b/tests/test_native_2d_line_buffer.py
index 6989179..b605dfe 100644
--- a/tests/test_native_2d_line_buffer.py
+++ b/tests/test_native_2d_line_buffer.py
@@ -368,7 +368,7 @@ and run it on test data sets."""
             sim.evaluate()
             quijibo += 1
             test1_scope = Scope(instance=LineBufferDef._instances[0], parent=scope)
-            test2_scope = Scope(instance=LineBufferDef._instances[0]._instances[0], paren
t=scope)
+            test2_scope = Scope(instance=LineBufferDef._instances[0]._instances[0], paren
t=test1_scope)
             test3_scope = Scope(instance=LineBufferDef._instances[0]._instances[0]._insta

I think the missing inst0 could be the problem. I wonder what select path the incomplete one actually refers to?

This assertion looks like it's assuming that the last select string is a number.

Examining the code around the line reported by interpret.cpp https://github.com/rdaly525/coreir/blob/e2cd7be195d30eb3a1599e77b93f4abefc768c50/src/simulator/interpret.cpp#L1583

it seems like it's in a case where it's trying to select off an array and assuming that the last select string is a number (which I'm guessing refers to selecting a member of the array).

leonardt commented 5 years ago

Okay, this change reveals

❯ git diff
diff --git a/src/simulator/interpret.cpp b/src/simulator/interpret.cpp
index d7d2d3a7..2c8f770f 100644
--- a/src/simulator/interpret.cpp
+++ b/src/simulator/interpret.cpp
@@ -1580,6 +1580,7 @@ namespace CoreIR {
         string lastSelStr = sPath.back();
+        cout << sp << endl;
         assert(isNumber(lastSelStr));
         //cout << sp << endl;

the following output

input 0: [False]
['inst0', 'inst0', 'inst0', 'inst0', 'inst0', 'inst0', 'inst0'] ['I']
inst0$inst0$inst0$inst0$inst0$inst0$inst0.in
Assertion failed: (isNumber(lastSelStr)), function getValueByOriginalName, file interpret.
cpp, line 1584.
zsh: abort      pytest -s
leonardt commented 5 years ago

From the comment above that code

    // Case 3: The value does not have a symbol table entry. 3 subcases:
    //      1. The value is invalid
    //      2. Need to traverse up the type hierarchy
    //      3. Need to traverse down the type hierarchy
David-Durst commented 5 years ago

You are correct in the code you think I'm referring to in https://github.com/phanrahan/magma/issues/287#issuecomment-421520873.

Does this mean the symbol table is missing something its supposed to have?

leonardt commented 5 years ago

I made this change

diff --git a/src/simulator/interpret.cpp b/src/simulator/interpret.cpp
index d7d2d3a7..23e519c8 100644
--- a/src/simulator/interpret.cpp
+++ b/src/simulator/interpret.cpp
@@ -1559,6 +1559,12 @@ namespace CoreIR {
       //cout << "Entry name = " << entName << endl;
       return getValueByOriginalName(entName);
     }
+    cout << "NAME: " << name << endl;
+    cout << "BEGIN SYMBOL TABLE" << endl;
+    for (auto& ent : symTable) {
+        cout << ent.first << " : " << (ent.first == name) << endl;
+    }
+    cout << "END SYMBOL TABLE" << endl;

And before the assertion failure, I get the output

NAME: inst0$inst0$inst0$inst0$inst0$inst0$inst0
BEGIN SYMBOL TABLE
inst0$inst0$inst0$inst0$inst0$inst0$inst0.in : 0
inst0$inst0$inst0$inst0$inst0$inst0$inst0.out : 0
...
leonardt commented 5 years ago

Actually it looks like it's trying to select NAME: inst0$inst0$inst0$inst0$inst0$inst0$inst0, which isn't a port

David-Durst commented 5 years ago

Wait, which line of the python code is trying to do this. Shouldn't

test2_scope = Scope(instance=LineBufferDef._instances[0]._instances[0], parent=test1_scope)

Have a much shorter symbol?

leonardt commented 5 years ago

Before the assertion I get input 0: [False] in the output, so I believe it should be this line: https://github.com/David-Durst/aetherling/blob/ed4a632ef935dc79fe991677300a8e092ed9df4b/tests/test_native_2d_line_buffer.py#L389

leonardt commented 5 years ago

I see an issue here:

NAME: inst0$inst0$inst0$inst0$inst0$inst0$inst0.I
BEGIN SYMBOL TABLE
inst0$inst0$inst0$inst0$inst0$inst0$inst0.in : 0
inst0$inst0$inst0$inst0$inst0$inst0$inst0.out : 0

It's trying to select NAME: inst0$inst0$inst0$inst0$inst0$inst0$inst0.I but it's not in the table, however there is an there's a inst0$inst0$inst0$inst0$inst0$inst0$inst0.in, so could this be related to port remapping?

David-Durst commented 5 years ago

I see. I don't think the issue is with this get_value call in particular. Any get_value that uses a scope that descends from test1_scope fails. When I add the following line

@@ -368,11 +368,15 @@ and run it on test data sets."""
             sim.evaluate()
             quijibo += 1
             test1_scope = Scope(instance=LineBufferDef._instances[0], parent=scope)
-            test2_scope = Scope(instance=LineBufferDef._instances[0]._instances[0], parent=scope)
+            test2_scope = Scope(instance=LineBufferDef._instances[0]._instances[0], parent=test1_scope)
+            print("testing test2 {}".format(sim.get_value(LineBufferDef._instances[0]._instances[0].O, test2_scope)))
             test3_scope = Scope(instance=LineBufferDef._instances[0]._instances[0]._instances[0], parent=test2_scope)
             test4_scope = Scope(instance=LineBufferDef._instances[0]._instances[0]._instances[0]._instances[0], parent=test3_scope)
             test5_scope = Scope(instance=LineBufferDef._instances[0]._instances[0]._instances[0]._instances[0]._instances[0],

I get the following lookup that fails:

NAME: inst0$inst0$inst0.O_0_0_0
David-Durst commented 5 years ago

It could be the remapping, but the above issue is not a remapping one.

leonardt commented 5 years ago

yea that lookup looks odd, what are thos trailing 0_0_0s?

David-Durst commented 5 years ago

I believe they are indices in an array. That port is a quadruply nested array. Maybe something can only handle singly nested arrays, so its getting one component at a time?

leonardt commented 5 years ago

Oh, I forgot about flattening. It could be a flattening issue in the symbol table logic?

David-Durst commented 5 years ago

Potentially, I heard from @rdaly525 before that the symbol table has issues with flattening. Am I remembering correctly, @rdaly525 ?

rdaly525 commented 5 years ago

Symbol table was a hacked together thing that occurred in the space of about a week. It is very fragile and should be rethought and rebuilt.

rdaly525 commented 5 years ago

Specifically, if any pass other than flattening is run on it, it will likely not work.

David-Durst commented 5 years ago

Ok, so I fixed my bug and my tests now all pass. I had an off by 1 error in a deeply nested counter. However, creating all the nested ports to so that I could look at the internal state of a deeply nested counter took several hours. It would be a massive productivity boost if the symbol table worked and I was able to use the debugger to view nested ports. How much effort effort @rdaly525 do you think it would be to fix the symbol table.

leonardt commented 5 years ago

I think this issue has been resolved. @David-Durst you can open a separate issue for the symbol table on coreir if you want to raise it.

David-Durst commented 5 years ago

See https://github.com/rdaly525/coreir/issues/619