google / xls

XLS: Accelerated HW Synthesis
http://google.github.io/xls/
Apache License 2.0
1.15k stars 165 forks source link

scheduling matrix multiplication #983

Open guoriyue opened 1 year ago

guoriyue commented 1 year ago

I'm trying to convert the matrix_mul_4x4.x file in examples/ to matrix_mul_4x4.ir using this command, but I always get this Top-level type info not found error:

./bazel-bin/xls/dslx/ir_convert/ir_converter_main --top=matmul /home/ubuntu/xls-debug/scripts/matrix_mul_4x4.x > /home/ubuntu/xls-debug/scripts/matrix_mul_4x4.ir

F0525 14:44:46.769275 217674 ir_converter_main.cc:163] Check failed: ::absl::OkStatus() == (status) (OK vs. NOT_FOUND: Top-level type info not found for proc "matmul".)

When I convert my own matrix multiplication (with fixed size) ir to verilog, the generated verilog result always load everything in the first stage and do all the umul and add computation in the second stage, even if I specify pipeline stages (more than 2) and clock_period_ps. Could you please explain wh? The documentation states that if pipeline stages and clock_period_ps are specified, the tool will schedule minimum registers. However, it seems that all computations are unrolled in one stage, resembling ASAP scheduling rather than minimum register scheduling. Could you clarify this behavior?

Additionally, does the scheduling in codegen introduce any parallelism? such as splitting the matrix into blocks and pipelining the block computation to reduce the number of registers? Is this parallelism only introduced when using proc and channel in DSL, or does it apply in pass optimization or scheduling?

cdleary commented 1 year ago

Addressing the questions in order:

guoriyue commented 1 year ago

I'm using the same script as matmul_4x4.x in examples (https://github.com/google/xls/blob/main/xls/examples/matmul_4x4.x), except that I deleted line 115 #[test_proc], so that I could use test_proc as the top function. but I get an out_of_range error:

terminate called after throwing an instance of 'std::out_of_range' what(): absl::container_internal::raw_hash_map<>::at Aborted (core dumped)

below are the command and script I use.

./bazel-bin/xls/dslx/ir_convert/ir_converter_main --top=test_proc /home/ubuntu/xls-debug/scripts/matmul_4x4.x > /home/ubuntu/xls-debug/scripts/matmul_4x4.ir

// Copyright 2021 The XLS Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

// DSLX implementation of a 4x4 systolic array, appropriate for part of a
// matrix multiplier.

// TODO(rspringer): 2021-09-16, issue #497: The channel declarations here are a
// bit unwieldy; if we can use arrays-of-channels, that'll make things cleaner.

import float32
import xls.modules.fp.fp32_mul_2

type F32 = float32::F32;

// "node" performs the actual work of this systolic array, multiplying an input
// activation by the baked-in weight.
proc node {
  from_west: chan<F32> in;
  from_north: chan<F32> in;
  to_east: chan<F32> out;
  to_south: chan<F32> out;
  weight: F32;

  init { () }

  config(from_west: chan<F32> in, from_north: chan<F32> in,
         to_east: chan<F32> out, to_south: chan<F32> out, weight: F32) {
    (from_west, from_north, to_east, to_south, weight)
  }

  next(tok: token, state: ()) {
    let (tok, activation) = recv(tok, from_west);
    let (tok, partial_sum) = recv(tok, from_north);

    // Compute our partial product.
    let product = fp32_mul_2::fp32_mul_2(activation, weight);

    // Send the activation east and the partial product south.
    let tok = send(tok, to_east, activation);
    let tok = send(tok, to_south, product);
    ()
  }
}

proc matmul<ROWS: u32, COLS: u32> {
  zeroes_out: chan<F32>[COLS] out;

  init { () }

  config(activations_in: chan<F32>[ROWS] in,
         results_out: chan<F32>[COLS] out) {
    // Declare the east-to-west channels.
    // Add one extra channel for easy indexing / going "off the edge".
    // TODO(rspringer): 2022-04-18: Maybe eliminate the need for this?
    let (east_outputs, west_inputs) = chan<F32>[COLS + u32:1][ROWS];

    // Declare the north-to-south channels.
    let (south_outputs, north_inputs) = chan<F32>[COLS][ROWS];

    // TODO(rspringer): Zeros (as initial partial sums) would be best provided
    // by single-value channels.
    // Declare the zero-valued initial partial sum channels.
    let (zeroes_out, zeroes_in) = chan<F32>[COLS];

    // Spawn all the procs. Specify weights to give a "mul-by-two" matrix.
    let f32_0 = float32::zero(false);
    let f32_2 = F32 { sign: false, bexp: u8:128, fraction: u23:0 };

    // TODO(https://github.com/google/xls/issues/585): We can't loop (and thus
    // parameterize) this until we can constexpr evaluate `for` expressions.
    spawn node(activations_in[0], zeroes_in[0], east_outputs[0][1], south_outputs[1][0], f32_2);
    spawn node(west_inputs[0][1], zeroes_in[1], east_outputs[0][2], south_outputs[1][1], f32_0);
    spawn node(west_inputs[0][2], zeroes_in[2], east_outputs[0][3], south_outputs[1][2], f32_0);
    spawn node(west_inputs[0][3], zeroes_in[3], east_outputs[0][4], south_outputs[1][3], f32_0);

    spawn node(activations_in[1], north_inputs[1][0], east_outputs[1][1], south_outputs[2][0], f32_0);
    spawn node(west_inputs[1][1], north_inputs[1][1], east_outputs[1][2], south_outputs[2][1], f32_2);
    spawn node(west_inputs[1][2], north_inputs[1][2], east_outputs[1][3], south_outputs[2][2], f32_0);
    spawn node(west_inputs[1][3], north_inputs[1][3], east_outputs[1][4], south_outputs[2][3], f32_0);

    spawn node(activations_in[2], north_inputs[2][0], east_outputs[2][1], south_outputs[3][0], f32_0);
    spawn node(west_inputs[2][1], north_inputs[2][1], east_outputs[2][2], south_outputs[3][1], f32_0);
    spawn node(west_inputs[2][2], north_inputs[2][2], east_outputs[2][3], south_outputs[3][2], f32_2);
    spawn node(west_inputs[2][3], north_inputs[2][3], east_outputs[2][4], south_outputs[3][3], f32_0);

    spawn node(activations_in[3], north_inputs[3][0], east_outputs[3][1], results_out[0], f32_0);
    spawn node(west_inputs[3][1], north_inputs[3][1], east_outputs[3][2], results_out[1], f32_0);
    spawn node(west_inputs[3][2], north_inputs[3][2], east_outputs[3][3], results_out[2], f32_0);
    spawn node(west_inputs[3][3], north_inputs[3][3], east_outputs[3][4], results_out[3], f32_2);

    (zeroes_out,)
  }

  // All we need to do is to push in "zero" values to the top of the array.
  next(tok: token, state: ()) {
    let tok = send(tok, zeroes_out[0], float32::zero(false));
    let tok = send(tok, zeroes_out[1], float32::zero(false));
    let tok = send(tok, zeroes_out[2], float32::zero(false));
    let tok = send(tok, zeroes_out[3], float32::zero(false));
    ()
  }
}

proc test_proc {
  activations_out: chan<F32>[4] out;
  results_in: chan<F32>[4] in;
  terminator: chan<bool> out;

  init { () }

  config(terminator: chan<bool> out) {
    let (activations_out, activations_in) = chan<F32>[4];
    let (results_out, results_in) = chan<F32>[4];
    spawn matmul<u32:4, u32:4>(activations_in, results_out);
    (activations_out, results_in, terminator)
  }

  next(tok: token, state: ()) {
    let f32_0 = float32::zero(false);
    let f32_2 = F32 {sign: false, bexp: u8:128, fraction: u23:0};
    let f32_4 = F32 {sign: false, bexp: u8:129, fraction: u23:0};

    // Send the desired inputs.
    let tok = for (i, tok): (u32, token) in range(u32:0, u32:4) {
      send(tok, activations_out[i], f32_2)
    } (tok);

    // Send extra inputs to keep the system moving while our results are processing.
    let tok = for (j, tok): (u32, token) in range(u32:0, u32:4) {
      for (i, tok): (u32, token) in range(u32:0, u32:4) {
          send(tok, activations_out[i], f32_0)
      } (tok)
    } (tok);

    // Flush the intermediate values.
    let tok = for (j, tok): (u32, token) in range(u32:0, u32:0) {
      for (i, tok): (u32, token) in range(u32:0, u32:4) {
          let (tok, _) = recv(tok, results_in[i]);
          tok
      } (tok)
    } (tok);

    let (tok, value) = recv(tok, results_in[0]);
    let _ = assert_eq(value, f32_0);
    let (tok, value) = recv(tok, results_in[1]);
    let _ = assert_eq(value, f32_0);
    let (tok, value) = recv(tok, results_in[2]);
    let _ = assert_eq(value, f32_0);
    let (tok, value) = recv(tok, results_in[3]);
    let _ = assert_eq(value, f32_4);

    let tok = send(tok, terminator, true);
    ()
  }
}
guoriyue commented 1 year ago

Also, when trying my own matrix mul, if I use the unit delay model with specified pipeline_stages and clock_period_ps, all computations would be done in one stage, and the total register number is 18 + 9 (for load and write) = 27.

./bazel-bin/xls/tools/codegen_main --generator=pipeline --pipeline_stages=5 --clock_period_ps=5 --delay_model=unit

import std

fn matrix_mul(a: u32[3][3], b: u32[3][3])
  -> u32[9]{

  for(i, output): (u32, u32[9]) in range (u32:0, u32:3) {
    for(j, output): (u32, u32[9]) in range (u32:0, u32:3) {
      let sum = a[i][0] * b[j][0] + a[i][1] * b[j][1] + a[i][2] * b[j][2];
      let out_idx = i * u32:3 + j;
      update (output, out_idx, sum)
    } (output) // ??? delete ; after this and every thing works. probably because add ; makes it fail to see output as the result?
  } (u32[9]:[0,...])
}

but If a use a non-trivial delay model like sky130, since the computation is split into multiple stages, we need more registers to save the temporary result (like the result of a[0]*b[0]) in the middle. And for this case, 69 registers are recorded.

./bazel-bin/xls/tools/codegen_main --generator=pipeline --pipeline_stages=10000 --clock_period_ps=3000 --delay_model=sky130

since for both of them, pipeline_stages and clock_period_ps are specified, could you please confirm than they both work in the minimal register scheduling? How should I explain this?

guoriyue commented 1 year ago

By the way, if I don't change the original matmul_4x4 code, the out_of_range error still exist when trying to interpret the code.

./bazel-bin/xls/dslx/interpreter_main /home/ubuntu/xls-debug/scripts/matmul_4x4.x

terminate called after throwing an instance of 'std::out_of_range' what(): absl::container_internal::raw_hash_map<>::at Aborted (core dumped)

guoriyue commented 1 year ago

If the process spawn works, what will the generated Verilog (or generated architecture) look like? Will it only compute 1/16 matrix multiplication in a pipelined way and use fewer registers?