Extremely low performance for large test vectors

Kuree commented 4 years ago

When I did an exhaustive test on floating points, I noticed performance issue with fault generated testbench. The test vectors size is 0x500 * 0x500 = 0x190000 ~ 1 million data points. Here is the profiling result:

The entire runtime takes 22k seconds, about 6 hours. As a comparison, complete exhaustive test using SystemVerilog + DPI only takes 10 hours to finish (4 billion test points). So the performance gap is orders of magnitude (2000x).

Notice that the RTL simulation takes about 12k seconds to run. This is due to the sheer size of testbench code generated, which is 408M.

I think there are several places where it can be improved:

hwtypes computation. The native C implementation is about 100x faster. I looked at the actual float implementation and it is very inefficient:
```
exp = 1 - bias
s = ['-0.' if sign[0] else '0.', mantissa.binary_string()]
s.append('e')
s.append(str(exp))
return cls(gmpy2.mpfr(''.join(s),cls.mantissa_size+1, 2))
```
It is converting to and back from native gmp object using string. I think native conversion is available?
Getting frame information, particular filename and line number. I had the same performance issue before and was able to resolve it by implementing the same logic in native Python C API. It is included in kratos package. I can give you a pointer on how to use it, if you're interested. On my benchmark it is about 100-500x faster than doing the same thing in Python.

The real questions is whether the inefficient test bench is an intrinsic problem in fault. My understanding is that all the expect values have to be known during compilation time, which makes the testbench unscalable when there are lots of test vectors. One way to solve that is to allow simulator call python code during simulation. I have a prototype working: https://github.com/Kuree/kratos-dpi/blob/master/tests/test_function.py#L18-L28 It works with both Verilator and commercial simulators.

Please let me know what you think.

leonardt commented 4 years ago

For 2. are you using the syntax tester.circuit.O.expect(value)? If so, you can try tester.expect(DUT.O, value) instead. The tester.circuit syntax has some intrinsic performance issues (it uses the inspect library). Using tester.expect instead avoids the inspect library calls. I think a longer term solution is to use a deep embedding with AST introspection rather than using the tester.circuit interface. Again, the interface relies inherently on the inspect library which is slow. While we can work on optimizing it, I think efforts are better spent on a better interface that avoids inspect all together (since even if we optimize with the C-api, we'll still get an inherent performance penalty versus approaches that avoid inspect).

1/3 seem related. I think if we can do the expect computation in the test bench, we can avoid the overhead in the Python implementation. While calling python code in the simulation is probably useful in general, in this case we could probably compile the expect logic to the target. One simple example is to generate the input vectors into a binary file, run a C program to compute the expected output file (this could use the fast native implementation and avoid python overhead), then use the fault file i/o (like we do in TBG) to simply load the input/expected test vectors into memory and check them. It would interesting if we could wrap this pattern into a simple Python API.

So two things to try:

use tester.expect rather than tester.circuit.expect (also you can use tester.poke rather than tester.circuit.I = value)
try using a file based IO pattern for input/expected vectors (this should make the test generation small/fast since we just need to compile the inner loop rather than all input/output values)

Can we see how much this improves performance? Then we can look into providing a cleaner/simpler API for this pattern.

leonardt commented 4 years ago

Looking at https://github.com/Kuree/kratos-dpi/blob/master/tests/test_function.py#L18-L28, it does seem quite nifty, I think we could simply integrate this into fault by having fault be aware of these DPI functions and generate the code to call them?

leonardt commented 4 years ago

e.g. tester.expect(test.call(<dpi_func>, ...args)), so we could add a call action that accepts a DPI function and generates the code to invoke it

leonardt commented 4 years ago

Actually for 2. I noticed now that you're using the lassen PE to compute the expected result, so that doesn't make it easy to do the file based pattern unless we add support for compiling lassen to fast C code, so adding a tester.call for the DPI function might be the best approach to avoid having to generate everything ahead of time.

leonardt commented 4 years ago

Also, are input values being generated ahead of time? (I'm guessing so since expect values are generated ahead of time). That might be a related problem, having to store all the inputs in memory and scan through them will be inherently less memory efficient than just generating the inputs, using them to poke the DUT and compute the expected value, then discarding them (only need to keep one set of input/outputs in memory versus scanning through an entire test set).

leonardt commented 4 years ago

Thinking about this pattern at a higher level though, it seems like ideally the functional model should not change (maybe every so often when obscure bugs are found). So, pre-computing the expect output vectors and storing them in a file (even if this is done in Python using lassen) may not be a bad idea. If we expect the RTL to change more often than the functional model, reusing the input/output vectors might be smarter than re-computing them every time we run the test.

Furthermore, if we want to lift these tests to the tile level or something, we can avoid recomputing the expected output if we just store those. So, we may want to consider this file based approach, but perhaps we can abstract it using a TestVector data structure that is serializable and fault is aware of it (so the user doesn't have to manage the details of handling the binary file).

Kuree commented 4 years ago

I think writing functional model in python is way easier than C, and probably faster to implement than C++. The prototype I had does the following thing:

Take that python functions and embedded it into a C++ function call via pybind
Use pybind to compile the Python interpreter into the DPI library

You can see the generated C++ code here: https://github.com/Kuree/kratos-dpi/blob/master/tests/gold/generator_func.cc

Couple thoughts:

External libraries are supported, but they are looked up via sys.path. We should prepare the path to support any user libraries.
DPI also allows pointer pass. So we could even allow class function calls by proper casting pointer types in the generated wrapper code.

leonardt commented 4 years ago

Here are the things I think fault needs from kratos to call the compiled DPI function:

the DPI function declaration (i.e. import "DPI-C" function shortint test_add(input shortint a, input shortint b);). we could maybe just add an API to introspect the kratos object to determine what type the inputs/return value are and generate the rest of the declaration.
path to the generated DPI code (i.e. generator_func.cc)

Seems like dpi_compile could return an object/dictionary with this information

Kuree commented 4 years ago

Sounds good. That being said, that project was an experiment to see the potential of DPI-based simulation with Python, so it lacks many capabilities.

Here are some enhancement I can think of:

SV DPI supports chandle argument type, which is essentially a C raw pointer. We can use that to represent Python C object, which can get converted to a Python object natively. Doing so allows a functional model seamlessly interact with the test bench.
Some complex assertion logic can be handled in the Python side as well. This may allow both Verilator and SV simulators share the same assertion backend.

Point 1 may be a stretch goal, especially the Python class part. If we requires class object to be pickle-able, I think the implementation can be straight-forward. If you think this is a good idea, I can sketch out a detailed proposal and some prototypes.

Point 2 requires a solution for persistent Python object cross different DPI calls, which might be able to resolved by porint 1.

Please let me know what you think.

leonardt commented 4 years ago

Why does it need to be pickle-able? Is it because you want to pickle the object during test bench generation time, then load it in during simulation run time? Or does the object need to be serialized/deserialized during execution?
Seems related to 1 as you said, serializing/deserializing objects might be expensive. Is this because each invocation of a python DPI function spins up a fresh interpreter instance? In general it should be possible to keep one active interpreter running and dispatch execution to it, but perhaps there's some reason this isn't possible with the current prototype? I've done this in the past with this prototype code: https://github.com/leonardt/pycoreir/blob/master/libcoreir-python/coreir-python.cpp but it uses the raw C-API rather than pybind, so not sure if it's more difficult there. The GIL is a bit tricky to work with when doing this though.

Kuree commented 4 years ago

It's the former. I want to load the object during simulation. Since at that point we have lost the python objects, we need a way to reconstruct the functional model. Pure functional DPI call doesn't have this problem since it's stateless, but I doubt any complex functional model can be expressed in such way.
Each python DPI call doesn't spins up a fresh interpreter instance, if done carefully. I will take a closer look at pybind, but I think it's very doable. If we can guarantee only one thread can invoke the python execution, we should be able to workaround the GIL.

leonardt commented 4 years ago

@Kuree, have you had a chance to try whether using tester.poke(port, value) avoids the inspect overhead (versus tester.circuit.port = value)? If you can send me a pointer to the test file, I can give it a try

Kuree commented 4 years ago

No I haven't. I was caught up by other work. I will try to run it this week.

leonardt commented 4 years ago

Another use case for this came up: a stateful functional model in Python that is used to verify the behavior in RTL. This can't be done at compile time since the behavior of the model depends on runtime values (e.g. random ready/valid signals generated at runtime will determine functional model behavior). What we'd like to do is be able to call into a Python functional model at simulation time when an event occurs (e.g. when a ready/valid handshake occurs, call functional model so it updates it state). We would need to be able to have persistent state for functional models during simulation run, but it doesn't necessarily need to be setup/persistent from compile time (although maybe it could be nice).

Kuree commented 4 years ago

I think I'm doing something wrong here. Below is the new profile result after the change:

The code is at container keyi-romantic_ptolemy on kiwi. To attach the container, you can do docker attach keyi-romantic_ptolemy. Please detach the container instead of exit when you're done inspecting. To run the code, simply do:

python -m cProfile -o profile.stat tests/test_pe.py

Kuree commented 4 years ago

Just follow up the conversation about native support for Python models. I started to implement a new Python-to-DPI library that's designed to be framework-independent. I'm still trying to refactor the interface to make it easier to use, but the core implementation is there. https://github.com/Kuree/pysv

What is working now:

Foreign pip packages such as numpy and tensorflow
Python class

You can see the examples here in tests:

I'm working on generate SV and C++ class wrapper to allow users to use it directly without dealing with mingled function name. I'd appreciate it if you can help me the integration:

I'm having trouble compiling it on mac: https://github.com/Kuree/pysv/actions/runs/363936380
Integration into fault

leonardt commented 4 years ago

Hmm, looks like there's some use of inspect that I forgot was added https://github.com/leonardt/fault/blame/4ea7b511181aa857a5d85862dc93d8a3b59ac9a0/fault/tester/base.py#L152, @sgherbst was this use as part of your flow? I'll need to dig in to where it's used, but I'm guessing we can add a flag to disable it for performance.

sgherbst commented 4 years ago

@leonardt that would be totally fine; it's just for debugging purposes. The traceback is mainly helpful for SPICE simulations, where Expect actions are implemented in post-processing.

leonardt commented 4 years ago

Makes sense, we can maybe disable it by default and add a "debug" flag to enable it? Also, it looks like it's using inspect.stack to get the calling frame. We can also optimize this to use inspect.getcurrentframe and a back pointer to be a little more efficient (just fetches the frame you want rather than getting the entire stack into a list)

sgherbst commented 4 years ago

Great - it's OK with me if the feature is disabled by default.

leonardt commented 4 years ago

https://github.com/leonardt/fault/pull/288 removes the inspect logic from the default code path, can you try this out and see if it improves performance (mainly avoids the inspect calls).

Kuree commented 4 years ago

Here is the result after using #288

leonardt / fault

Extremely low performance for large test vectors #280