ros2 / rosidl_python

rosidl support for Python
Apache License 2.0
19 stars 45 forks source link

Deserialization speedup proposal #192

Open almaslov opened 1 year ago

almaslov commented 1 year ago

Feature request

Hello! As already mentioned in other discussions python implementation has issues in serialization and deserialization performance(e.g. here). I focus on deserialization here. I've tried some fixes on galactic branch and they look very promising for me: depending on message structure and size it's about 3-20 times faster. x3 for simple messages, up to x20 for complex types with high nesting level and array fields.

Feature description

The main idea is to avoid pure python code in message deserialization. To achieve it, every message is split into C extension base class and pure python derived class. C extension class contains all message fields, so deserialization method initializes C structure members without calling python methods. On the other hand pure python class contains all dunder methods and assertions.

Implementation considerations

Here is an example of how generated files look like for simple message file Example.msg:

_example_base.c extension module defines base class containing members of native types in lines 10-25.

Python module _example.py is the same as in upstream except for base class(line 62) and microoptimization of assertion in __init__ method(line 95).

Support extension module _example_s.c now uses specific subtype of PyObject(line 147), calls __new__ instead of __init__ to prevent pure python calls(line 170), directly initializes C typed fields of message(lines 188, 191) and creates nested collections (line 193-212).

Benchmark for this message which runs deserialization routine 1 million times gives 28 seconds vs 6,5 seconds

In case of more complex message like TestArrayComplex (from this issue) with 100 nested TestElement elements, the same benchmark gives 57,6 seconds vs 5,5 seconds for 50K deserializations.

I run benchmark with python3.8, ROS galactic, ubuntu 20.04, AMD 2,85GHz.

Pros:

Cons:

I'm interested if this solution worth porting from my local galactic branch to upstream. Or may be it violates some ROS architecture principles.

Voldivh commented 1 year ago

Hi @almaslov , thanks for doing this effort and contribution and taking the time to explain your solution/feature to this issue. Would it be possible for you to create a PR so we can review it a little bit in more detail and possibly merge a solution upstream?

almaslov commented 1 year ago

I was too optimistic about

public interface of message is not changed

It is not changed indeed, but costs me a questionable workaround. There are two inconsistent points:

Workaround is described in comment to metaclass new method.

In addition to the above I patched current implementation instead of making side one.

almaslov commented 1 year ago

I've tried the same approach for serialization and unexpectedly performance boost is quite large too. In case of the same message TestArrayComplex it is 16,5 seconds vs 1,4 seconds for 50K serializations.

Patch is in PR already.

Voldivh commented 1 year ago

@almaslov thanks for pushing this! I'll be taking a look 👍