scheduling matrix multiplication

I'm trying to convert the matrix_mul_4x4.x file in examples/ to matrix_mul_4x4.ir using this command, but I always get this Top-level type info not found error:

./bazel-bin/xls/dslx/ir_convert/ir_converter_main --top=matmul /home/ubuntu/xls-debug/scripts/matrix_mul_4x4.x > /home/ubuntu/xls-debug/scripts/matrix_mul_4x4.ir

F0525 14:44:46.769275 217674 ir_converter_main.cc:163] Check failed: ::absl::OkStatus() == (status) (OK vs. NOT_FOUND: Top-level type info not found for proc "matmul".)

When I convert my own matrix multiplication (with fixed size) ir to verilog, the generated verilog result always load everything in the first stage and do all the umul and add computation in the second stage, even if I specify pipeline stages (more than 2) and clock_period_ps. Could you please explain wh? The documentation states that if pipeline stages and clock_period_ps are specified, the tool will schedule minimum registers. However, it seems that all computations are unrolled in one stage, resembling ASAP scheduling rather than minimum register scheduling. Could you clarify this behavior?

Additionally, does the scheduling in codegen introduce any parallelism? such as splitting the matrix into blocks and pipelining the block computation to reduce the number of registers? Is this parallelism only introduced when using proc and channel in DSL, or does it apply in pass optimization or scheduling?

Addressing the questions in order:

Matmul is parameterized, so the <ROWS: u32, COLS: u32> need to be specified before invoking the IR conversion. What you mention you did, instantiating it for a given size so you have a non-parameterized thing to point at, sounds like the right way to go.
The scheduler will try to balance slack across pipeline stages if you specify a fixed number of pipeline stages. The benchmark_main tool should help to show the resulting schedule. You didn't mention which delay model you're using -- make sure you're using a non-trivial delay model. Reproduction commands you can share could help.
The scheduler can introduce pipeline parallelism within a proc, and can schedule concurrent operations within a proc, and the optimizer has scope to inline procs into one another. XLS doesn't manipulate parallelism beyond that set of transforms at the moment. In a sense, the input programs XLS gets are embarassingly concurrent, constrained only by dataflow dependencies, and the user tamps down on the concurrency of side-effecting communication using token sequencing (threading the token values between operations) to sequentialize operations. In the future we hope to be able to optimize networks of multiple communicating processes automatically based on e.g. output throughput requirements from a network of procs, but we're not there yet. Note (as is mentioned in the README), XLS doesn't focus on the triple-nested-loop-to-chip kind of tasks -- historically High Level Synthesis tended to focus on those to fill up entire FPGAs and similar. XLS focuses on up-leveling the hardware design process for blocks, it's more like RTL++ in terms of structure that's described and control of the result, which which helps to avoid the surprises that people tend to associate with HLS -- we tend to call this level of abstraction / productivity Mid-Level Synthesis.

I'm using the same script as matmul_4x4.x in examples (https://github.com/google/xls/blob/main/xls/examples/matmul_4x4.x), except that I deleted line 115 #[test_proc], so that I could use test_proc as the top function. but I get an out_of_range error:

terminate called after throwing an instance of 'std::out_of_range' what(): absl::container_internal::raw_hash_map<>::at Aborted (core dumped)

below are the command and script I use.

./bazel-bin/xls/dslx/ir_convert/ir_converter_main --top=test_proc /home/ubuntu/xls-debug/scripts/matmul_4x4.x > /home/ubuntu/xls-debug/scripts/matmul_4x4.ir

// Copyright 2021 The XLS Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

// DSLX implementation of a 4x4 systolic array, appropriate for part of a
// matrix multiplier.

// TODO(rspringer): 2021-09-16, issue #497: The channel declarations here are a
// bit unwieldy; if we can use arrays-of-channels, that'll make things cleaner.

import float32
import xls.modules.fp.fp32_mul_2

type F32 = float32::F32;

// "node" performs the actual work of this systolic array, multiplying an input
// activation by the baked-in weight.
proc node {
  from_west: chan<F32> in;
  from_north: chan<F32> in;
  to_east: chan<F32> out;
  to_south: chan<F32> out;
  weight: F32;

  init { () }

  config(from_west: chan<F32> in, from_north: chan<F32> in,
         to_east: chan<F32> out, to_south: chan<F32> out, weight: F32) {
    (from_west, from_north, to_east, to_south, weight)
  }

  next(tok: token, state: ()) {
    let (tok, activation) = recv(tok, from_west);
    let (tok, partial_sum) = recv(tok, from_north);

    // Compute our partial product.
    let product = fp32_mul_2::fp32_mul_2(activation, weight);

    // Send the activation east and the partial product south.
    let tok = send(tok, to_east, activation);
    let tok = send(tok, to_south, product);
    ()
  }
}

proc matmul<ROWS: u32, COLS: u32> {
  zeroes_out: chan<F32>[COLS] out;

  init { () }

  config(activations_in: chan<F32>[ROWS] in,
         results_out: chan<F32>[COLS] out) {
    // Declare the east-to-west channels.
    // Add one extra channel for easy indexing / going "off the edge".
    // TODO(rspringer): 2022-04-18: Maybe eliminate the need for this?
    let (east_outputs, west_inputs) = chan<F32>[COLS + u32:1][ROWS];

    // Declare the north-to-south channels.
    let (south_outputs, north_inputs) = chan<F32>[COLS][ROWS];

    // TODO(rspringer): Zeros (as initial partial sums) would be best provided
    // by single-value channels.
    // Declare the zero-valued initial partial sum channels.
    let (zeroes_out, zeroes_in) = chan<F32>[COLS];

    // Spawn all the procs. Specify weights to give a "mul-by-two" matrix.
    let f32_0 = float32::zero(false);
    let f32_2 = F32 { sign: false, bexp: u8:128, fraction: u23:0 };

    // TODO(https://github.com/google/xls/issues/585): We can't loop (and thus
    // parameterize) this until we can constexpr evaluate `for` expressions.
    spawn node(activations_in[0], zeroes_in[0], east_outputs[0][1], south_outputs[1][0], f32_2);
    spawn node(west_inputs[0][1], zeroes_in[1], east_outputs[0][2], south_outputs[1][1], f32_0);
    spawn node(west_inputs[0][2], zeroes_in[2], east_outputs[0][3], south_outputs[1][2], f32_0);
    spawn node(west_inputs[0][3], zeroes_in[3], east_outputs[0][4], south_outputs[1][3], f32_0);

    spawn node(activations_in[1], north_inputs[1][0], east_outputs[1][1], south_outputs[2][0], f32_0);
    spawn node(west_inputs[1][1], north_inputs[1][1], east_outputs[1][2], south_outputs[2][1], f32_2);
    spawn node(west_inputs[1][2], north_inputs[1][2], east_outputs[1][3], south_outputs[2][2], f32_0);
    spawn node(west_inputs[1][3], north_inputs[1][3], east_outputs[1][4], south_outputs[2][3], f32_0);

    spawn node(activations_in[2], north_inputs[2][0], east_outputs[2][1], south_outputs[3][0], f32_0);
    spawn node(west_inputs[2][1], north_inputs[2][1], east_outputs[2][2], south_outputs[3][1], f32_0);
    spawn node(west_inputs[2][2], north_inputs[2][2], east_outputs[2][3], south_outputs[3][2], f32_2);
    spawn node(west_inputs[2][3], north_inputs[2][3], east_outputs[2][4], south_outputs[3][3], f32_0);

    spawn node(activations_in[3], north_inputs[3][0], east_outputs[3][1], results_out[0], f32_0);
    spawn node(west_inputs[3][1], north_inputs[3][1], east_outputs[3][2], results_out[1], f32_0);
    spawn node(west_inputs[3][2], north_inputs[3][2], east_outputs[3][3], results_out[2], f32_0);
    spawn node(west_inputs[3][3], north_inputs[3][3], east_outputs[3][4], results_out[3], f32_2);

    (zeroes_out,)
  }

  // All we need to do is to push in "zero" values to the top of the array.
  next(tok: token, state: ()) {
    let tok = send(tok, zeroes_out[0], float32::zero(false));
    let tok = send(tok, zeroes_out[1], float32::zero(false));
    let tok = send(tok, zeroes_out[2], float32::zero(false));
    let tok = send(tok, zeroes_out[3], float32::zero(false));
    ()
  }
}

proc test_proc {
  activations_out: chan<F32>[4] out;
  results_in: chan<F32>[4] in;
  terminator: chan<bool> out;

  init { () }

  config(terminator: chan<bool> out) {
    let (activations_out, activations_in) = chan<F32>[4];
    let (results_out, results_in) = chan<F32>[4];
    spawn matmul<u32:4, u32:4>(activations_in, results_out);
    (activations_out, results_in, terminator)
  }

  next(tok: token, state: ()) {
    let f32_0 = float32::zero(false);
    let f32_2 = F32 {sign: false, bexp: u8:128, fraction: u23:0};
    let f32_4 = F32 {sign: false, bexp: u8:129, fraction: u23:0};

    // Send the desired inputs.
    let tok = for (i, tok): (u32, token) in range(u32:0, u32:4) {
      send(tok, activations_out[i], f32_2)
    } (tok);

    // Send extra inputs to keep the system moving while our results are processing.
    let tok = for (j, tok): (u32, token) in range(u32:0, u32:4) {
      for (i, tok): (u32, token) in range(u32:0, u32:4) {
          send(tok, activations_out[i], f32_0)
      } (tok)
    } (tok);

    // Flush the intermediate values.
    let tok = for (j, tok): (u32, token) in range(u32:0, u32:0) {
      for (i, tok): (u32, token) in range(u32:0, u32:4) {
          let (tok, _) = recv(tok, results_in[i]);
          tok
      } (tok)
    } (tok);

    let (tok, value) = recv(tok, results_in[0]);
    let _ = assert_eq(value, f32_0);
    let (tok, value) = recv(tok, results_in[1]);
    let _ = assert_eq(value, f32_0);
    let (tok, value) = recv(tok, results_in[2]);
    let _ = assert_eq(value, f32_0);
    let (tok, value) = recv(tok, results_in[3]);
    let _ = assert_eq(value, f32_4);

    let tok = send(tok, terminator, true);
    ()
  }
}

Also, when trying my own matrix mul, if I use the unit delay model with specified pipeline_stages and clock_period_ps, all computations would be done in one stage, and the total register number is 18 + 9 (for load and write) = 27.

./bazel-bin/xls/tools/codegen_main --generator=pipeline --pipeline_stages=5 --clock_period_ps=5 --delay_model=unit

import std

fn matrix_mul(a: u32[3][3], b: u32[3][3])
  -> u32[9]{

  for(i, output): (u32, u32[9]) in range (u32:0, u32:3) {
    for(j, output): (u32, u32[9]) in range (u32:0, u32:3) {
      let sum = a[i][0] * b[j][0] + a[i][1] * b[j][1] + a[i][2] * b[j][2];
      let out_idx = i * u32:3 + j;
      update (output, out_idx, sum)
    } (output) // ??? delete ; after this and every thing works. probably because add ; makes it fail to see output as the result?
  } (u32[9]:[0,...])
}

but If a use a non-trivial delay model like sky130, since the computation is split into multiple stages, we need more registers to save the temporary result (like the result of a[0]*b[0]) in the middle. And for this case, 69 registers are recorded.

./bazel-bin/xls/tools/codegen_main --generator=pipeline --pipeline_stages=10000 --clock_period_ps=3000 --delay_model=sky130

since for both of them, pipeline_stages and clock_period_ps are specified, could you please confirm than they both work in the minimal register scheduling? How should I explain this?

By the way, if I don't change the original matmul_4x4 code, the out_of_range error still exist when trying to interpret the code.

./bazel-bin/xls/dslx/interpreter_main /home/ubuntu/xls-debug/scripts/matmul_4x4.x

terminate called after throwing an instance of 'std::out_of_range' what(): absl::container_internal::raw_hash_map<>::at Aborted (core dumped)

If the process spawn works, what will the generated Verilog (or generated architecture) look like? Will it only compute 1/16 matrix multiplication in a pipelined way and use fewer registers?

google / xls

scheduling matrix multiplication #983