HWPE Example STILL Hangs on ZedBoard

This ticket is a follow up issue to https://github.com/pulp-platform/pulpissimo/issues/104. I still have the same issue on a fresh master version of pulpissimo with a recent sdk (followed documentation via make build-pulp-sdk). As the closed issue suggests, there was possible a fix inside a recent release of the sdk, but the version inside the makefile is tied to the tag 2019.12.06 which looks quite old - is a newer version preferable? and does it contain the fix?

or could it also be related to not inlcuded fixes inside the pulp_soc in ips_list.yml - the version is tied to v2.0.1 - changes for example made to update the hwpe versions https://github.com/pulp-platform/pulp_soc/pull/49/commits are just released in version v2.1.0. Is it safe to use this newer version of the soc and does it contain the fix?

it looks like v2.1.0 of the pulp_soc really contains the fix, as indicated by this issue, still there is no further update since feb. https://github.com/pulp-platform/pulpissimo/issues/127 Is it safe to use the newer version? UPDATE: a fast test with version v2.1.0 produces

Abnormal program termination (11)
#
# An unexpected error has occurred (11)
#
Stack:
/lib/x86_64-linux-gnu/libc.so.6(+0x3f040) [0x7f412b9ec040]
/home/paul/Xilinx/Vivado/2019.1/lib/lnx64.o/librdi_synth.so(ElabSvlgScope::subprogramScope() const+0x2d) [0x7f40731f29dd]
/home/paul/Xilinx/Vivado/2019.1/lib/lnx64.o/librdi_synth.so(ElabSvlgScope::addRhsDecl(ElabSvlgDecl*)+0x17d) [0x7f40731f3a8d]
/home/paul/Xilinx/Vivado/2019.1/lib/lnx64.o/librdi_synth.so(ElabSvlgScope::cleanup()+0x72) [0x7f40731f6d22]
...

I also tried the fix from issue 69 https://github.com/pulp-platform/pulpissimo/issues/69, but setting #define ARCHI_SOC_EVENT_FCHWPE0 140 to 140 still did not fix it UPDATE: checked both with 46 and 140, default was set to 140 both did not fix the issue

my current hwme.c file:

/*
 * Copyright (C) 2018 ETH Zurich and University of Bologna
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

/* 
 * Authors:  Francesco Conti <fconti@iis.ee.ethz.ch>
 */

#include "pulp.h"
#include <stdio.h>
#include <stdint.h>
#include "archi/hwme/hwme_v1.h"
#include "hal/hwme/hwme_v1.h"
#include <rt/rt_api.h>

int __rt_fpga_fc_frequency =20000000; //<Core Frequency> // e.g. 20000000 for 20MHz;
int __rt_fpga_periph_frequency =10000000; // <SoC Frequency> // e.g. 10000000 for 10MHz;
unsigned int __rt_iodev_uart_baudrate = 115200;

//fix taken from https://github.com/pulp-platform/pulpissimo/issues/69
#define ARCHI_SOC_EVENT_FCHWPE0 140

#define USE_STIMULI
// comment below line to run only dot product with bias
//#define DO_MATVEC_MULT
#ifndef DO_MATVEC_MULT
    #define DO_DOT_PROD
#endif

#include "hwme_stimuli_a.h"
#include "hwme_stimuli_b.h"
#include "hwme_stimuli_c.h"
#include "hwme_stimuli_d.h"

int main() {

  uint32_t *a = (uint8_t *) 0x1c010000;
  uint32_t *b = (uint8_t *) 0x1c010200;
  uint32_t *c = (uint8_t *) 0x1c010400;
  uint32_t *d = (uint8_t *) 0x1c010600;

  int coreID = get_core_id();
#ifdef DO_MATVEC_MULT
  // define dimensions
  uint32_t in_vec_len = 8;
  uint32_t out_vec_len = 10;
#endif

  volatile int errors = 0;
  int gold_sum = 0, check_sum = 0;
  int i,j;

  int offload_id_tmp, offload_id;

  if(get_core_id() == 0) {

#ifdef USE_STIMULI
    for(int i=0; i<512; i++) {
      ((uint8_t *) a)[i] = stim_a[i];
    }
    for(int i=0; i<512; i++) {
      ((uint8_t *) b)[i] = stim_b[i];
    }
    for(int i=0; i<512; i++) {
#ifdef DO_MATVEC_MULT
      ((uint8_t *) c)[i] = 0; // no bias for matrix vector multiplication
#else
      ((uint8_t *) c)[i] = stim_c[i];
#endif
    }
    for(int i=0; i<512; i++) {
      ((uint8_t *) d)[i] = stim_d[i];
    }
#else
    for(int i=0; i<128; i++) {
      a[i] = i;
    }
    for(int i=0; i<128; i++) {
      b[i] = i;
    }
    for(int i=0; i<128; i++) {
      c[i] = i;
    }
    for(int i=0; i<128; i++) {
      d[i] = i;
    }
#endif
    printf("Hello !\n");
    for(int i=0; i<4; i++) {
      printf("%d: %08x\n", i, d[i]);
    }

    /* convolution-accumulation - HW */
    plp_hwme_enable();
    printf("Enabled !\n");
    while((offload_id_tmp = hwme_acquire_job()) < 0);
    printf("hwme acquired !\n");
    // set up bytecode
    hwme_bytecode_set(HWME_LOOPS1_OFFS,           0x00000000);
    hwme_bytecode_set(HWME_BYTECODE5_LOOPS0_OFFS, 0x00040000);
    hwme_bytecode_set(HWME_BYTECODE4_OFFS,        0x00000000);
    hwme_bytecode_set(HWME_BYTECODE3_OFFS,        0x00000000);
    hwme_bytecode_set(HWME_BYTECODE2_OFFS,        0x00000000);
    hwme_bytecode_set(HWME_BYTECODE1_OFFS,        0x000008cd);
    hwme_bytecode_set(HWME_BYTECODE0_OFFS,        0x11a13c05);

    // job-dependent registers
    hwme_a_addr_set((unsigned int) a);
    hwme_b_addr_set((unsigned int) b);
    hwme_c_addr_set((unsigned int) c);
    hwme_d_addr_set((unsigned int) d);
#ifdef DO_MATVEC_MULT
    hwme_nb_iter_set(out_vec_len);
    hwme_len_iter_set(in_vec_len-1);
    hwme_vectstride_set(in_vec_len*4); // stride for the matrix is equal to in_vec length * wordsize
    hwme_vectstride2_set(0); // stride for the vector is zero
#else
    hwme_nb_iter_set(4);
    hwme_len_iter_set(32-1);
    hwme_vectstride_set(32*4);
    hwme_vectstride2_set(32*4); // same stride for both streams
#endif
    hwme_shift_simplemul_set(hwme_shift_simplemul_value(0, 0));

    // start HWME operation
    printf("Start HWME operation !\n");
    hwme_trigger_job();
    printf("Started HWME operation !\n");
    // wait for end of compuation
    soc_eu_fcEventMask_setEvent(ARCHI_SOC_EVENT_FCHWPE0);
    __rt_periph_wait_event(ARCHI_SOC_EVENT_FCHWPE0, 1);
  //   unsigned long long start_time, end_time;
  //   start_time = rt_time_get_us();
  //   rt_time_wait_us(1200);
  //   end_time = rt_time_get_us();
  //   printf("Start time %lld\n", start_time);
  // printf("End time %lld\n", end_time);
    //hwme_trigger_job();
    plp_hwme_disable();
    printf("Wait for end of computation !\n");
    // check
#ifndef USE_STIMULI
    if(d[0] != 0x000028b0) errors++;
    if(d[1] != 0x000124b1) errors++;
    if(d[2] != 0x000320b2) errors++;
    if(d[3] != 0x00061cb3) errors++;
#else
    #ifdef DO_MATVEC_MULT
        if(d[0] != 0x7CB12A38) errors++;
        if(d[1] != 0xCD4F4DCB) errors++;
        if(d[2] != 0x49CD5D5C) errors++;
        if(d[3] != 0x2A1D8706) errors++;
    #else
        if(d[0] != 0x7f228fd6) errors++;
        if(d[1] != 0x23a7d5c2) errors++;
        if(d[2] != 0x7f281848) errors++;
        if(d[3] != 0x6127d834) errors++;
    #endif
#endif /* USE_STIMULI */

    printf("errors=%d\n", errors);
    printf("Done with computation !\n");
    for(int i=0; i<4; i++) {
      printf("%d: %08x\n", i, d[i]);
    }
  }
   printf("Sync barrier !\n");
   synch_barrier();
   printf("Done with everything !\n");
   return errors;
}

@bluewww @FrancescoConti @meggiman any suggestions on how I could get the rt-example running?

Hi @Fatalon

I am not very familiar with the hwme. @FrancescoConti would be the expert on that. However, did you try running your code on an RTL simulator before trying it on the FPGA? Debugging these kind of issues on the FPGA is significantly harder than doing it in an RTL simulator. Right out of the box we support QuestaSim for RTL simulation. Also: Are can you running this code with the 'pulp-runtime'? Pulp-runtime is a more minimal baremetal SDK that is considerably easier to debug. The 2019 pulp-sdk is deprecated and replaced with a completely new one which, at the moment, is not yet fully feature equivalent with the old one. If you manage to produce a test.c file that compiles with 'pulp-runtime' and triggers the issue you describe I could give it a try on my own machine with the latest pulpissimo master. Also, with regards to your question on compatibility of pulpissimo with pulp_soc v2.1.0. I don't think we introduced any breaking changes in latest pulp_soc so you should be fine updating it manually. The reason why pulpissimo is stuck to an older version are a number of outstanding inter-dependent pull-requests (#217, #218 ) that change the build flow of pulpissimo. Once they are merged, a new major release of pulpissimo with the latest features of the sub-dependencies will be released.

Hey @meggiman, thanks for your reply.

Unfortunately, i don't own a ModelSim/QuestaSim license and can just rely on running it on the FPGA - I could try to run it with Vivado's integrated simulator xsim, but I dont know, if this is easily possible. Regarding your question about the pulp-runtime: No, I just tried to run it via the SDK, because the runtime states that it does not support all features and because of the pulp-runtime-examples repo, I thought it might not support hwpe at all, like the gvsoc. I will try it out and report back, if using the pulp-runtime fixed my issue - but at first glance I am not 100% certain, if I can even use the runtime without RTL or gvsoc... I will upload a test.c as soon as I have one, thanks.

Regarding the pulp_soc v2.1.0: I tried it today again and still got the same issue I checked the commit history and maybe it could be related to the replacement of the generic_memory in commit https://github.com/pulp-platform/pulp_soc/commit/db140983aa9a55685d919c0d0ab7d7718b65664d. As private and interleaved ram are also tracked inside pulpissimo/fpga/pulpissimo-board/ips - but this is just a wild guess, as i am still a beginner in this field

Is there a specific newer version for the hwme that could be used with pulp_soc v2.0.1, which would work out of the box?

@meggiman i played around with the runtime but as soon as i execute hwme_acquire_job() I think i access bad memory resulting in an unable to halt hart 992 - are you sure that the runtime is even able to handle hwme? my current test.c is basically the hwme,c but with copied methods that the runtime was missing. I also needed to adjust the method declaration of synch_barrier() in the pulp_runtime (include/pulp.h) to be static as it is in the sdk (changed signature to static void synch_barrier();)

#include <pulp.h>
#include <stdio.h>
#include <stdint.h>
#include "archi/hwme/hwme_v1.h"
#include "hal/hwme/hwme_v1.h"

#include "hal/pulp.h"

int __rt_fpga_fc_frequency =20000000; //<Core Frequency> // e.g. 20000000 for 20MHz;
int __rt_fpga_periph_frequency =10000000; // <SoC Frequency> // e.g. 10000000 for 10MHz;
unsigned int __rt_iodev_uart_baudrate = 115200;

#define USE_STIMULI
// comment below line to run only dot product with bias
//#define DO_MATVEC_MULT
#ifndef DO_MATVEC_MULT
    #define DO_DOT_PROD
#endif

#include "hwme_stimuli_a.h"
#include "hwme_stimuli_b.h"
#include "hwme_stimuli_c.h"
#include "hwme_stimuli_d.h"

#define RT_FC_TINY_DATA __attribute__((section(".data_tiny_fc"))) __attribute__ ((tiny))
volatile RT_FC_TINY_DATA unsigned int __rt_socevents_status[ARCHI_SOC_EVENT_NB_TOTAL/32];

static inline void synch_barrier() {
#if defined(ARCHI_HAS_CLUSTER)
#if defined(EU_VERSION) && EU_VERSION >= 3
  if (!rt_is_fc()) rt_team_barrier();
#endif
#endif
}

static inline int rt_irq_disable()
{
  return hal_irq_disable();
}

static inline void rt_irq_restore(int irq)
{
  hal_irq_restore(irq);
}

static inline void rt_irq_enable()
{
  __asm__ __volatile__ ("" : : : "memory");
  hal_irq_enable();
}

#ifdef __riscv__
static inline void rt_wait_for_interrupt()
{
#if !defined(ARCHI_HAS_FC) || defined(ARCHI_HAS_FC_EU)
  eu_evt_wait();
#else
  hal_itc_wait_for_interrupt();
#endif
}
#else
void rt_wait_for_interrupt();
#endif

void __rt_periph_wait_event(int event, int clear)
{
  int irq = rt_irq_disable();

  int index = event >> 5;
  event &= 0x1f;

  while(!((__rt_socevents_status[index] >> event) & 1))
  {
    rt_wait_for_interrupt();
    rt_irq_enable();
    rt_irq_disable();
  }

  if (clear) __rt_socevents_status[index] &= ~(1<<event);

  rt_irq_restore(irq);
}

int main() {

  uint32_t *a = (uint8_t *) 0x1c010000;
  uint32_t *b = (uint8_t *) 0x1c010200;
  uint32_t *c = (uint8_t *) 0x1c010400;
  uint32_t *d = (uint8_t *) 0x1c010600;

  int coreID = get_core_id();
#ifdef DO_MATVEC_MULT
  // define dimensions
  uint32_t in_vec_len = 8;
  uint32_t out_vec_len = 10;
#endif

  volatile int errors = 0;
  int gold_sum = 0, check_sum = 0;
  int i,j;

  int offload_id_tmp, offload_id;

  if(get_core_id() == 0) {

#ifdef USE_STIMULI
    for(int i=0; i<512; i++) {
      ((uint8_t *) a)[i] = stim_a[i];
    }
    for(int i=0; i<512; i++) {
      ((uint8_t *) b)[i] = stim_b[i];
    }
    for(int i=0; i<512; i++) {
#ifdef DO_MATVEC_MULT
      ((uint8_t *) c)[i] = 0; // no bias for matrix vector multiplication
#else
      ((uint8_t *) c)[i] = stim_c[i];
#endif
    }
    for(int i=0; i<512; i++) {
      ((uint8_t *) d)[i] = stim_d[i];
    }
#else
    for(int i=0; i<128; i++) {
      a[i] = i;
    }
    for(int i=0; i<128; i++) {
      b[i] = i;
    }
    for(int i=0; i<128; i++) {
      c[i] = i;
    }
    for(int i=0; i<128; i++) {
      d[i] = i;
    }
#endif
    printf("Running with %d as event id!\n",ARCHI_SOC_EVENT_FCHWPE0);
    for(int i=0; i<4; i++) {
      printf("%d: %08x\n", i, d[i]);
    }

    /* convolution-accumulation - HW */
    plp_hwme_enable();
    printf("Enabled !\n");
    while((offload_id_tmp = hwme_acquire_job()) < 0);
    printf("hwme acquired !\n");
    // set up bytecode
    hwme_bytecode_set(HWME_LOOPS1_OFFS,           0x00000000);
    hwme_bytecode_set(HWME_BYTECODE5_LOOPS0_OFFS, 0x00040000);
    hwme_bytecode_set(HWME_BYTECODE4_OFFS,        0x00000000);
    hwme_bytecode_set(HWME_BYTECODE3_OFFS,        0x00000000);
    hwme_bytecode_set(HWME_BYTECODE2_OFFS,        0x00000000);
    hwme_bytecode_set(HWME_BYTECODE1_OFFS,        0x000008cd);
    hwme_bytecode_set(HWME_BYTECODE0_OFFS,        0x11a13c05);

    // job-dependent registers
    hwme_a_addr_set((unsigned int) a);
    hwme_b_addr_set((unsigned int) b);
    hwme_c_addr_set((unsigned int) c);
    hwme_d_addr_set((unsigned int) d);
#ifdef DO_MATVEC_MULT
    hwme_nb_iter_set(out_vec_len);
    hwme_len_iter_set(in_vec_len-1);
    hwme_vectstride_set(in_vec_len*4); // stride for the matrix is equal to in_vec length * wordsize
    hwme_vectstride2_set(0); // stride for the vector is zero
#else
    hwme_nb_iter_set(4);
    hwme_len_iter_set(32-1);
    hwme_vectstride_set(32*4);
    hwme_vectstride2_set(32*4); // same stride for both streams
#endif
    hwme_shift_simplemul_set(hwme_shift_simplemul_value(0, 0));

    // start HWME operation
    printf("Start HWME operation !\n");
    hwme_trigger_job();
    printf("Started HWME operation !\n");
    // wait for end of compuation
    soc_eu_fcEventMask_setEvent(ARCHI_SOC_EVENT_FCHWPE0);
    __rt_periph_wait_event(ARCHI_SOC_EVENT_FCHWPE0, 1);
  //   unsigned long long start_time, end_time;
  //   start_time = rt_time_get_us();
  //   rt_time_wait_us(1200);
  //   end_time = rt_time_get_us();
  //   printf("Start time %lld\n", start_time);
  // printf("End time %lld\n", end_time);
    //hwme_trigger_job();
    plp_hwme_disable();
    printf("Wait for end of computation !\n");
    // check
#ifndef USE_STIMULI
    if(d[0] != 0x000028b0) errors++;
    if(d[1] != 0x000124b1) errors++;
    if(d[2] != 0x000320b2) errors++;
    if(d[3] != 0x00061cb3) errors++;
#else
    #ifdef DO_MATVEC_MULT
        if(d[0] != 0x7CB12A38) errors++;
        if(d[1] != 0xCD4F4DCB) errors++;
        if(d[2] != 0x49CD5D5C) errors++;
        if(d[3] != 0x2A1D8706) errors++;
    #else
        if(d[0] != 0x7f228fd6) errors++;
        if(d[1] != 0x23a7d5c2) errors++;
        if(d[2] != 0x7f281848) errors++;
        if(d[3] != 0x6127d834) errors++;
    #endif
#endif /* USE_STIMULI */

    printf("errors=%d\n", errors);
    printf("Done with computation !\n");
    for(int i=0; i<4; i++) {
      printf("%d: %08x\n", i, d[i]);
    }
  }
   printf("Sync barrier !\n");
   synch_barrier();
   printf("Done with everything !\n");
   return errors;
}

output via UART: Running with 140 as event id! 0: 8397dde2 1: 27c1cf5a 2: d413b753 3: 946a5ddb Enabled ! gdb output

168     /* convolution-accumulation - HW */
169     plp_hwme_enable();
170     printf("Enabled !\n");
171     while((offload_id_tmp = hwme_acquire_job()) < 0);
172     printf("hwme acquired !\n");
173     // set up bytecode
174     hwme_bytecode_set(HWME_LOOPS1_OFFS,           0x00000000);
(gdb) b 171
Breakpoint 3 at 0x1c0081ea: file hwme.c, line 171.
(gdb) c
Continuing.
^Cunable to halt hart 992
  dmcontrol=0x83e00001
  dmstatus =0x00030c82
Remote failure reply: E0E
Remote failure reply: E0E

i will need to test, that this memory issue is not a result of my interrupting the execution - but even without breakpoints, the execution stops at this line and the riscy is at this point broken and does not reply to other commands anymore

Sorry for the long response latency -- unfortunately I don't have as much time as I used to to track GitHub :(

I see two possible software-related issues:

HWME mapped at the wrong address: it should not be this one, as I checked in the runtime code. Double-check the base address at which HWME is mapped is 0x1A10C000
HWME event connected incorrectly: I did not track PULPissimo evolution too much on this side, and it is possible that the event is not used correctly anymore. If you use a busy-waiting loop and the accelerator starts working like a charm, then this is the right problem.

If I manage to find some time, I'll try to have a look at both of these in RTL sim.

For what concerns possible HW-related issues, I am not aware of anything specific -- HW accelerators usually synthesize without particular issues in ASIC and, as far as I know, on FPGA with possibly some modifications (replacing latches -> flip-flops, for example). I will drop a few more comments in the other issue you opened https://github.com/pulp-platform/hwpe-ctrl/issues/9

no problem, as written in the other issue i was busy myself and unfortunately didnt had the time to further investigate - at least i checked the ARCHI_FC_HWPE_ADDR which was mapped to 0x1A10C000 in my case (which looks right). regarding the busy waiting I couldnt try it further but will test it, if I find further time

Hi @Fatalon,

I'm working on getting the default HWPE work on ZCU102, and am currently stuck as explained in #274. Have you found time to investigate/debug the problem? Is the HWPE running correctly on FPGA on your end?

Hey @yttuncel, Unfortunately I had to return my borrowed Zedboard back to my university. But your issue sounds different to this open thread, so I guess you were at least able to trigger your job successfully (and overcome my issue). Regarding the state of the hwpe example inside rt-examples: As far as I understood it, the example code was written on top of the old SDK, which is no longer supported (https://github.com/pulp-platform/pulp-sdk) and is unaltered since over 3 years. Perhaps you can get an older version of pulpissimo running. An alternative solution was already mentioned here by @meggiman: "The 2019 pulp-sdk is deprecated and replaced with a completely new one which, at the moment, is not yet fully feature equivalent with the old one. If you manage to produce a test.c file that compiles with 'pulp-runtime' and triggers the issue you describe I could give it a try on my own machine with the latest pulpissimo master."

pulp-platform / pulpissimo

HWPE Example STILL Hangs on ZedBoard #257