Performance limitations when accessing packed arrays and multidimensional unpacked arrays

wmlye commented 1 year ago

Continuation from #840

In #840 I mentioned that there is a performance limitation when dealing with packed arrays that we could potentially improve with a copy-on-write mechanism. Looking at the .vvp output from this testcase:

module test(input logic clk);

logic [1023:0][31:0] mem1;
logic [31:0] mem2[1023:0];

logic [31:0] x1,x2;
integer i;

always @(posedge clk) begin
  for(i=0;i<1024;i++) begin
    x1 = mem1[i];
    x2 = mem2[i];
  end
end

endmodule

I define two different memories, one as a packed array of 1K 32-bit words, the other as an unpacked array of the same size. I use a clock edge to avoid extremely large sensitivity lists, and after some manual name mangling in the .vvp file in order to make it more human-readable, the two memories appear as:

mem1 .var "mem1", 32767 0;
mem2 .array "mem2", 0 1023, 31 0;

The first for() loop, accessing the packed array and storing the result in x1, uses the following opcodes,:

    %load/vec4 mem1;
    %load/vec4 i;
    %pad/s 37;
    %muli 32, 0, 37;
    %part/s 32;
    %store/vec4 x1, 0, 32;

The second for() loop, accessing the packed array and storing the result in x2, uses the following opcodes:

    %ix/getv/s 4, i;
    %load/vec4a mem2, 4;
    %store/vec4 x2, 0, 32;

The second is obviously more efficient to access the memory because it can take advantage of the array dimensions, whereas the first spends several opcodes calculating the packed canonical dimensions index. I raised another issue (#835) that @larsclausen identified as being related to #521 when dealing with multiple dimensions of packed arrays so I've not made a complete set of packed and unpacked array structures, but my experiments dealing with multidimensional unpacked arrays are showing mergers of the two mechanisms, using the efficient unpacked approach for the first index and the inefficient packed approach for successive indexes. The second (unpacked) approach also has the advantage that if we were to add run-time bounds checking we can do so relatively easily, however multiple dimensions and packed vs. unpacked techniques mess with that. That's the subject of #848, and @caryr has an idea how to improve that.

Before I get into the meat of my issue, I have a question to all the experts out there: Is there any reason to treat packed vs unpacked array dimensions differently? I ask because anything that deals with multi-dimension arrays going forwards will be vastly simplified if packed and unpacked arrays are laid out in the same way and were accessed using the same mechanisms. I also realize that the question of packed vs. unpacked arrays isn't quite congruent to the question of packed vs. unpacked structures.

In addition to the inefficiencies inherent in the packed array mechanism due to the number of opcodes required to calculate the canonical index, there's a fundamental performance limitation inherent to the %load/vec4 mem1; at the start: it does a bit-by-bit copy (in vvp/vvp_net_sig.cc/vvp_fun_signal4_sa::recv_vec4()) of the entirety of mem1's value onto the top of the stack and then only grabs the bits we need when we get to the %part. If we have a large packed array and have a tight iterated loop this means we will be continually allocating values, copying to them, grabbing a small chunk, then discarding the rest.

If the VM could be modified to adopt the unpacked array mechanism for all array accesses, this would be ideal. Alternatively, there's an other optimization that could be done and it may well have benefits beyond this specific issue: adopting copy-on-write. In fact, even if the VM were modified to adopt the unpacked mechanism for all array accesses, I suspect that there will be noticeable performance improvements elsewhere if copy-on-write were adopted.

I think the simplest implementation would be to modify %load4/vec4 to push a vvp_net_t onto the stack instead of creating a vvp_vector4_t , copying the value into it, and pushing that onto the stack. This would require anything that anything that pops a value from the stack operates correctly when it gets a vvp_net_t instead of a vvp_vector4_t which may require a re-organization of the class hierarchy to make it efficient. Anything that modifies the value on the top of the stack (i.e. anything that uses vec4_peek() - this includes %part) would then need to actually do the copy-on-write operation, replacing a vvp_net_t on the top of the stack with a new vvp_vector4_t.

There are some potential additional benefits to doing this; in #848, I point out that currently when we use the packed array access mechanism the %load/vec4 value copy means we've dropped any information associated with the net so attempting to give meaningful warning messages when doing %part is essentially impossible. Having %load/vec4 place the vvp_net_t on the stack instead of the vvp_vector4_t means %part has a fighting chance to find the net name, assuming there is an efficient mechanism to invert the vpiHandle->vvp_net_t reference.

caryr commented 1 year ago

Something to keep in mind is that unpacked arrays can have things strings, reals, etc. elements so make sure you keep that in mind when thinking about refactoring all this.

caryr commented 1 year ago

Another thing to consider is we would like all of this to work correctly with the VPI so please review those requirements as well. Welcome to the, I'm going to learn more about SystemVerilog than I ever expected experience ;).

wmlye commented 1 year ago

I kind of suspected that's the case. I'm back to my day job on Tuesday, so I won't be raising anywhere near as many issues, but my interest is definitely piqued :-)

If I were to start going through some of the data structures and modifying comments to make them Doxygen-compliant in advance of doing something like this, would that be seen as something useful?

steveicarus / iverilog

Performance limitations when accessing packed arrays and multidimensional unpacked arrays #849