Open Lynx005F opened 2 months ago
I moved it now to the refactored branch where a lot of the intermediate changes are no longer visible. If I do any further modifications, then they will most likely affect the streamer, and maybe the vulnerability analysis scripts.
This now includes everything fully finished except the deduplicator, which I would like to merge seperately.
The deduplicator only improve performance on the memory side, functionality is the same, however that parts seems to be tricky to get right, so waiting for these additional 30 LOC would delay the whole thing by quite a bit.
This PR adds redundancy to Redmule:
It always implements basic datapath redundancy which can selected in software with redundancy = 1 in the redmule config:
This part has very low data overhead, but only allows for rough redundancy - datapath faults will get detected but control is generally vulnerable. Also some parts of the W-Input datapath is vulnerable.
It additionally implements control redundancy which can be enabled with the
USE_REDUNDANCY
parameter bit.This introduces about 8% area overhead and achieves a high level of fault tolerance. With the included fault injection scripts and injecting on any signal with equal likelyhood (e.g. control signals get overstressed compared to faults in the wild, control is typically harder to protect) it results in a correct termination for 99.99% of injected faults. To put that into perspective a non fault-tolerant RedMulE would correctly terminate for about 85% of injected faults due to masking.
Both parts use the existing registers for ECC faults which should be read from software and will abort an operation if a fault is detected to avoid stalls.
The redundancy spheres overlapp with the ECC Encode / Decode and are independent e.g. any combination of memory ECC and Redundancy results in a working design, even though some of them might not be reasonable from a fault-tolerance perspective. Non-regression tests have been updated to test all combinations of HW and SW parameters.
Currently the soft clear signal in the register file can still cause wrong terminations (stall where fault is detected but no interrupt send in this case), and a lot of the dependencies are not yet reviewed, as such this is a draft PR.