pulp-platform / redmule

Other
32 stars 12 forks source link

Add Redundancy to RedMulE #28

Open Lynx005F opened 2 months ago

Lynx005F commented 2 months ago

This PR adds redundancy to Redmule:

  1. It always implements basic datapath redundancy which can selected in software with redundancy = 1 in the redmule config:

    • For the compute elements, two rows of compute elements "mirror" the same computation
    • For the FIFOs and Buffers, data is replicated serially (in time) or parallelly depending on the location
    • Both of these are compared with a single output comparator of one data with of FFs
    • Parity bits are stored in the register file to ensure it does not get corrupted in between SW writes and HW reads.

    This part has very low data overhead, but only allows for rough redundancy - datapath faults will get detected but control is generally vulnerable. Also some parts of the W-Input datapath is vulnerable.

  2. It additionally implements control redundancy which can be enabled with the USE_REDUNDANCY parameter bit.

    • The redmule_scheduler and redmule_controller FSMs are replicated and outputs compared
    • The HCI / HWPE Modules (Muxes / Fifos) are replicated with smaller data-path on the replica, allowing their control FSMs to als be protected
    • For the vulnerable parts off W-Input datapath parity bits or full duplication is used to ensure faults can be detected

    This introduces about 8% area overhead and achieves a high level of fault tolerance. With the included fault injection scripts and injecting on any signal with equal likelyhood (e.g. control signals get overstressed compared to faults in the wild, control is typically harder to protect) it results in a correct termination for 99.99% of injected faults. To put that into perspective a non fault-tolerant RedMulE would correctly terminate for about 85% of injected faults due to masking.

Both parts use the existing registers for ECC faults which should be read from software and will abort an operation if a fault is detected to avoid stalls.

The redundancy spheres overlapp with the ECC Encode / Decode and are independent e.g. any combination of memory ECC and Redundancy results in a working design, even though some of them might not be reasonable from a fault-tolerance perspective. Non-regression tests have been updated to test all combinations of HW and SW parameters.

Currently the soft clear signal in the register file can still cause wrong terminations (stall where fault is detected but no interrupt send in this case), and a lot of the dependencies are not yet reviewed, as such this is a draft PR.

Lynx005F commented 1 month ago

I moved it now to the refactored branch where a lot of the intermediate changes are no longer visible. If I do any further modifications, then they will most likely affect the streamer, and maybe the vulnerability analysis scripts.

Lynx005F commented 1 month ago

This now includes everything fully finished except the deduplicator, which I would like to merge seperately.

The deduplicator only improve performance on the memory side, functionality is the same, however that parts seems to be tricky to get right, so waiting for these additional 30 LOC would delay the whole thing by quite a bit.