Open Kuree opened 4 years ago
For 2. are you using the syntax tester.circuit.O.expect(value)
? If so, you can try tester.expect(DUT.O, value)
instead. The tester.circuit
syntax has some intrinsic performance issues (it uses the inspect library). Using tester.expect
instead avoids the inspect
library calls. I think a longer term solution is to use a deep embedding with AST introspection rather than using the tester.circuit
interface. Again, the interface relies inherently on the inspect
library which is slow. While we can work on optimizing it, I think efforts are better spent on a better interface that avoids inspect all together (since even if we optimize with the C-api, we'll still get an inherent performance penalty versus approaches that avoid inspect).
1/3 seem related. I think if we can do the expect computation in the test bench, we can avoid the overhead in the Python implementation. While calling python code in the simulation is probably useful in general, in this case we could probably compile the expect logic to the target. One simple example is to generate the input vectors into a binary file, run a C program to compute the expected output file (this could use the fast native implementation and avoid python overhead), then use the fault file i/o (like we do in TBG) to simply load the input/expected test vectors into memory and check them. It would interesting if we could wrap this pattern into a simple Python API.
So two things to try:
tester.expect
rather than tester.circuit.expect
(also you can use tester.poke
rather than tester.circuit.I = value
)Can we see how much this improves performance? Then we can look into providing a cleaner/simpler API for this pattern.
Looking at https://github.com/Kuree/kratos-dpi/blob/master/tests/test_function.py#L18-L28, it does seem quite nifty, I think we could simply integrate this into fault by having fault
be aware of these DPI functions and generate the code to call them?
e.g. tester.expect(test.call(<dpi_func>, ...args))
, so we could add a call
action that accepts a DPI function and generates the code to invoke it
Actually for 2. I noticed now that you're using the lassen
PE to compute the expected result, so that doesn't make it easy to do the file based pattern unless we add support for compiling lassen to fast C code, so adding a tester.call
for the DPI function might be the best approach to avoid having to generate everything ahead of time.
Also, are input values being generated ahead of time? (I'm guessing so since expect values are generated ahead of time). That might be a related problem, having to store all the inputs in memory and scan through them will be inherently less memory efficient than just generating the inputs, using them to poke the DUT and compute the expected value, then discarding them (only need to keep one set of input/outputs in memory versus scanning through an entire test set).
Thinking about this pattern at a higher level though, it seems like ideally the functional model should not change (maybe every so often when obscure bugs are found). So, pre-computing the expect output vectors and storing them in a file (even if this is done in Python using lassen) may not be a bad idea. If we expect the RTL to change more often than the functional model, reusing the input/output vectors might be smarter than re-computing them every time we run the test.
Furthermore, if we want to lift these tests to the tile level or something, we can avoid recomputing the expected output if we just store those. So, we may want to consider this file based approach, but perhaps we can abstract it using a TestVector data structure that is serializable and fault is aware of it (so the user doesn't have to manage the details of handling the binary file).
I think writing functional model in python is way easier than C, and probably faster to implement than C++. The prototype I had does the following thing:
pybind
pybind
to compile the Python interpreter into the DPI libraryYou can see the generated C++ code here: https://github.com/Kuree/kratos-dpi/blob/master/tests/gold/generator_func.cc
Couple thoughts:
sys.path
. We should prepare the path to support any user libraries.Here are the things I think fault needs from kratos to call the compiled DPI function:
import "DPI-C" function shortint test_add(input shortint a, input shortint b);
). we could maybe just add an API to introspect the kratos object to determine what type the inputs/return value are and generate the rest of the declaration.generator_func.cc
)Seems like dpi_compile
could return an object/dictionary with this information
Sounds good. That being said, that project was an experiment to see the potential of DPI-based simulation with Python, so it lacks many capabilities.
Here are some enhancement I can think of:
chandle
argument type, which is essentially a C raw pointer. We can use that to represent Python C object, which can get converted to a Python object natively. Doing so allows a functional model seamlessly interact with the test bench.Point 1 may be a stretch goal, especially the Python class part. If we requires class object to be pickle-able, I think the implementation can be straight-forward. If you think this is a good idea, I can sketch out a detailed proposal and some prototypes.
Point 2 requires a solution for persistent Python object cross different DPI calls, which might be able to resolved by porint 1.
Please let me know what you think.
@Kuree, have you had a chance to try whether using tester.poke(port, value)
avoids the inspect overhead (versus tester.circuit.port = value
)? If you can send me a pointer to the test file, I can give it a try
No I haven't. I was caught up by other work. I will try to run it this week.
Another use case for this came up: a stateful functional model in Python that is used to verify the behavior in RTL. This can't be done at compile time since the behavior of the model depends on runtime values (e.g. random ready/valid signals generated at runtime will determine functional model behavior). What we'd like to do is be able to call into a Python functional model at simulation time when an event occurs (e.g. when a ready/valid handshake occurs, call functional model so it updates it state). We would need to be able to have persistent state for functional models during simulation run, but it doesn't necessarily need to be setup/persistent from compile time (although maybe it could be nice).
I think I'm doing something wrong here. Below is the new profile result after the change:
The code is at container keyi-romantic_ptolemy
on kiwi. To attach the container, you can do docker attach keyi-romantic_ptolemy
. Please detach the container instead of exit when you're done inspecting.
To run the code, simply do:
python -m cProfile -o profile.stat tests/test_pe.py
Just follow up the conversation about native support for Python models. I started to implement a new Python-to-DPI library that's designed to be framework-independent. I'm still trying to refactor the interface to make it easier to use, but the core implementation is there. https://github.com/Kuree/pysv
What is working now:
You can see the examples here in tests:
I'm working on generate SV and C++ class wrapper to allow users to use it directly without dealing with mingled function name. I'd appreciate it if you can help me the integration:
Hmm, looks like there's some use of inspect
that I forgot was added https://github.com/leonardt/fault/blame/4ea7b511181aa857a5d85862dc93d8a3b59ac9a0/fault/tester/base.py#L152, @sgherbst was this use as part of your flow? I'll need to dig in to where it's used, but I'm guessing we can add a flag to disable it for performance.
@leonardt that would be totally fine; it's just for debugging purposes. The traceback is mainly helpful for SPICE simulations, where Expect
actions are implemented in post-processing.
Makes sense, we can maybe disable it by default and add a "debug" flag to enable it? Also, it looks like it's using inspect.stack
to get the calling frame. We can also optimize this to use inspect.getcurrentframe
and a back pointer to be a little more efficient (just fetches the frame you want rather than getting the entire stack into a list)
Great - it's OK with me if the feature is disabled by default.
https://github.com/leonardt/fault/pull/288 removes the inspect logic from the default code path, can you try this out and see if it improves performance (mainly avoids the inspect calls).
Here is the result after using #288
When I did an exhaustive test on floating points, I noticed performance issue with fault generated testbench. The test vectors size is 0x500 * 0x500 = 0x190000 ~ 1 million data points. Here is the profiling result:
The entire runtime takes 22k seconds, about 6 hours. As a comparison, complete exhaustive test using SystemVerilog + DPI only takes 10 hours to finish (4 billion test points). So the performance gap is orders of magnitude (2000x).
Notice that the RTL simulation takes about 12k seconds to run. This is due to the sheer size of testbench code generated, which is 408M.
I think there are several places where it can be improved:
hwtypes
computation. The native C implementation is about 100x faster. I looked at the actual float implementation and it is very inefficient:It is converting to and back from native gmp object using string. I think native conversion is available?
kratos
package. I can give you a pointer on how to use it, if you're interested. On my benchmark it is about 100-500x faster than doing the same thing in Python.The real questions is whether the inefficient test bench is an intrinsic problem in fault. My understanding is that all the expect values have to be known during compilation time, which makes the testbench unscalable when there are lots of test vectors. One way to solve that is to allow simulator call python code during simulation. I have a prototype working: https://github.com/Kuree/kratos-dpi/blob/master/tests/test_function.py#L18-L28 It works with both Verilator and commercial simulators.
Please let me know what you think.