Graphics: Storing `_points` for `Smooth*` implementations result in a performance drop even on non-smooth drawings

Software Versions

Python: *
OS: *
Kivy: >=2.3.0
Kivy installation method: *

During tests that later pointed out an issue regarding GL_TRIANGLE_FAN (See PR #8662), Cython HTML annotation and perf tests made quite clear that storing _points for Smooth implementations as a python list leads to a noticeable performance drop.

Image 30-03-24 at 11 46

We should investigate alternatives for storing these values, work on "faster Kivy graphics", and avoid calls when unneeded.

For testing this issue (and for PR #8662 ) the following script has been used (it's just bad code meant for stressing out our Ellipse implementation) by continuously scrolling up and down a huge amount of ellipses.

from kivy.app import App
from kivy.uix.widget import Widget
from kivy.graphics import Ellipse
from kivy.clock import Clock

class EllipseDemo(Widget):
    def __init__(self, **kwargs):
        super(EllipseDemo, self).__init__(**kwargs)
        self.ellipses = []
        self.scroll_y = 0
        self.direction = 1

        # Create x ellipses
        for i in range(800):
            # Position the ellipses on a 10x10 grid
            ellipse = Ellipse(size=(50, 50), angle_start=0, angle_end=360)
            self.ellipses.append(ellipse)
            self.canvas.add(ellipse)

        # Update the scroll every 1/120s
        Clock.schedule_interval(self.update_scroll, 1 / 120)

    def update_scroll(self, dt):
        if self.scroll_y >= self.height:
            self.direction = -20
        if self.scroll_y <= 0:
            self.direction = 20

        self.scroll_y += self.direction

        # Update the segments for all ellipses
        for i in range(800):
            pos = (self.width / 20) * (i % 20), (self.height / 20) * (
                i // 20
            ) + self.scroll_y
            self.ellipses[i].pos = pos

        print(f"FPS: {Clock.get_fps()}")

class EllipseDemoApp(App):
    def build(self):
        return EllipseDemo()

if __name__ == "__main__":
    EllipseDemoApp().run()

Test configuration: Ubuntu 22.04 + i7-7700HQ CPU + Intel(R) HD Graphics 630

Test results with master branch: ~44 fps Test results with master branch, by disabling the self._points.extend([real_x, real_y]) line on the Ellipse implementation: ~60 fps

I did not tested this specific fix on other SW/HW configurations, but I expect something similar, as only the CPU is involved here.

@DexerBR (with no hurry) as you are the hero who recently touched these lines, do you have any suggestion on how to improve this specific area?

@DexerBR (with no hurry) as you are the hero who recently touched these lines, do you have any suggestion on how to improve this specific area?

Good catch @misl6! I can also confirm the difference in performance you mentioned.

I think an alternative implementation using Cython objects can significantly improve the performance of this code. Take, for example, the changes marked as # NOTE: alternative code in the code below.

From my tests, the difference in performance between using this alternative implementation and not using it seems to be almost imperceptible (feel free to test it), unlike the performance difference you mentioned (current implementation using Python objects).


# NOTE: alternative code
cdef struct ListPoints:
    float x
    float y

cdef class Ellipse(Rectangle):
    '''A 2D ellipse.

    :Parameters:
        `segments`: int, the default value is calculated from the range between angle.
            Define how many segments are needed for drawing the ellipse.
            The ellipse drawing will be smoother if you have many segments,
            however you can also use this property to create polygons with 3 or more sides.
        `angle_start`: float, defaults to 0.0
            Specifies the starting angle, in degrees, of the disk portion.
        `angle_end`: float, defaults to 360.0
            Specifies the ending angle, in degrees, of the disk portion.

    .. versionchanged:: 1.0.7
        Added angle_start and angle_end.

    .. versionchanged:: 2.2.0
        The default number of segments is no longer 180, it is now calculated
        according to the angle range, as this is a more efficient approach.

    '''
    cdef int _segments
    cdef float _angle_start
    cdef float _angle_end

    # NOTE: alternative code
    cdef ListPoints* _list_points

    def __init__(self, *args, **kwargs):
        Rectangle.__init__(self, **kwargs)
        self.batch.set_mode('triangle_fan')
        self._segments = kwargs.get('segments') or 0
        self._angle_start = kwargs.get('angle_start') or 0.0
        self._angle_end = kwargs.get('angle_end') or 360.0

    cdef void build(self):
        cdef float *tc = self._tex_coords
        cdef int i, angle_dir
        cdef double angle_start, angle_end, angle_range
        cdef double x, y, angle, rx, ry, ttx, tty, tx, ty, tw, th
        cdef double cx, cy, tangential_factor, radial_factor, fx, fy
        cdef vertex_t *vertices = NULL
        cdef unsigned short *indices = NULL
        cdef int segments = self._segments

        # reset points
        # self._points = []

        if self.w == 0 or self.h == 0:
            return

        if segments == 0 or segments < 3:
            if segments != 0:
                Logger.warning('Ellipse: A minimum of 3 segments is required. The default value will be used instead.')
            segments = max(1, int(abs(self._angle_end - self._angle_start) / 2))

        # NOTE: alternative code
        free(self._list_points)
        self._list_points = <ListPoints*>malloc((segments + 1) * sizeof(ListPoints))
        if self._list_points == NULL:
            raise MemoryError("Failed to allocate memory for points")

        tx = tc[0]
        ty = tc[1]
        tw = tc[4] - tx
        th = tc[5] - ty
        angle = 0.0
        rx = 0.5 * self.w
        ry = 0.5 * self.h

        vertices = <vertex_t *>malloc((segments + 2) * sizeof(vertex_t))
        if vertices == NULL:
            raise MemoryError('vertices')

        indices = <unsigned short *>malloc((segments + 2) * sizeof(unsigned short))
        if indices == NULL:
            free(vertices)
            raise MemoryError('indices')

        # calculate the start/end angle in radians, and adapt the range
        if self._angle_end > self._angle_start:
            angle_dir = 1
        else:
            angle_dir = -1

        # rad = deg * (pi / 180), where pi / 180 = 0.0174...
        angle_start = self._angle_start * 0.017453292519943295
        angle_end = self._angle_end * 0.017453292519943295
        angle_range = -1 * (angle_end - angle_start) / segments

        # add start vertex in the middle
        x = self.x + rx
        y = self.y + ry
        ttx = ((x - self.x) / self.w) * tw + tx
        tty = ((y - self.y) / self.h) * th + ty
        vertices[0].x = <float>(self.x + rx)
        vertices[0].y = <float>(self.y + ry)
        vertices[0].s0 = <float>ttx
        vertices[0].t0 = <float>tty
        indices[0] = 0

        # super fast ellipse drawing
        # credit goes to: http://slabode.exofire.net/circle_draw.shtml
        tangential_factor = tan(angle_range)
        radial_factor = cos(angle_range)

        # Calculate the coordinates for a circle with radius 0.5 about
        # the point (0.5, 0.5). Only stretch to an ellipse later.
        cx = 0.5
        cy = 0.5
        r = 0.5
        x = r * sin(angle_start)
        y = r * cos(angle_start)

        for i in range(1, segments + 2):
            ttx = (cx + x) * tw + tx
            tty = (cy + y) * th + ty
            real_x = self.x + (cx + x) * self.w
            real_y = self.y + (cy + y) * self.h
            vertices[i].x = <float>real_x
            vertices[i].y = <float>real_y
            vertices[i].s0 = <float>ttx
            vertices[i].t0 = <float>tty
            indices[i] = i

            fx = -y
            fy = x
            x += fx * tangential_factor
            y += fy * tangential_factor
            x *= radial_factor
            y *= radial_factor

            # self._points.extend([real_x, real_y])

            # NOTE: alternative code
            self._list_points[i - 1].x = <float>real_x
            self._list_points[i - 1].y = <float>real_y

        self.batch.set_data(vertices, segments + 2, indices, segments + 2)

        free(vertices)
        free(indices)

All codes related to Smooth* and AntialiasingLine instructions would have to be updated to take into account the new structure, as a consequence, an improvement in performance could be noticed!

Did not tried it on runtime, but at first glance looks faster.

What about doing the following instead (fixes both this issue and possibly other performance issues due to the memory allocation like what discussed in #8662 ):

Let's malloc vertices and indices when the number of segments changes (or anything else that requires to change the vertices and indices count)
Let's free vertices and indices only when the Ellipse is de-allocated (via __dealloc__), or when we need to update the vertices or indices size.
Let's use vertices to have points for the Smooth* implementations (no extra values stored anywhere).

How does this plan sound?

Let's malloc vertices and indices when the number of segments changes (or anything else that requires to change the vertices and indices count)

Let's free vertices and indices only when the Ellipse is de-allocated (via __dealloc__), or when we need to update the vertices or indices size.

I find this approach interesting, although the impact on memory needs to be measured to be sure that memory consumption would not increase significantly for apps that use many primitives. I mean, memory consumption will obviously increase, but we would have to have a metric to define to what extent this increase would be acceptable. We would need to measure the performance of using this technique and see if the performance improvement is worth the increase in memory consumption.

Another possibility would be to have a kind of option that lets the user define what type of approach they want when rendering the primitives, for example: 1 - more memory consumption and better performance (hypothetically) 2 - lower memory consumption and lower performance (hypothetically)

Let's use vertices to have points for the Smooth* implementations (no extra values stored anywhere).

Interestingly, however, there would have to be a mapping to define what the external points of the primitive are. For example, to draw an ellipse, the interior vertices are discarded (360 degrees). The same logic applies to other primitives. Nothing that seems too complex, but it is something that needs to be taken into account when "re-using" vertices to draw AntialiasingLine

Another option would also be to keep the memory allocated until it is used by the Smooth* instructions, and then it would be freed. For non-Smooth* instructions, the memory would be deallocated in another way.

I mean, memory consumption will increase, but we would have to have a metric to define to what extent this increase would be acceptable.

Yeah, I agree. Better to fix out the _points performance issue before, and then later check if we need to try out other performance boosts.

Another option would also be to keep the memory allocated until it is used by the Smooth instructions, and then it would be freed. For non-Smooth instructions, the memory would be deallocated in another way.

I like this approach, I guess we can do the following:

Create a method that allocates the memory for vertices and indices (move out from build)
Create a method that calculates the vertices and indices (move out from build)
Create a method that de-allocates the memory for vertices and indices (move out from build)
Have a flag that decides whether to automatically de-allocate or no memory after build, and free it manually for Smooth* implementations, by calling the _vertices_indices_dealloc() method.

Do you think it can work?

Please remember we're in 3.0.0, so if we need to break things to improve things, I'm all-in.

I mean, memory consumption will increase, but we would have to have a metric to define to what extent this increase would be acceptable.

Yeah, I agree. Better to fix out the _points performance issue before, and then later check if we need to try out other performance boosts.

Another option would also be to keep the memory allocated until it is used by the Smooth instructions, and then it would be freed. For non-Smooth instructions, the memory would be deallocated in another way.

I like this approach, I guess we can do the following:

Create a method that allocates the memory for vertices and indices (move out from build)

Create a method that calculates the vertices and indices (move out from build)

Create a method that de-allocates the memory for vertices and indices (move out from build)

Have a flag that decides whether to automatically de-allocate or no memory after build, and free it manually for Smooth* implementations, by calling the _vertices_indices_dealloc() method.

Do you think it can work?

Please remember we're in 3.0.0, so if we need to break things to improve things, I'm all-in.

Yes, I think it could work! After implementing this it will be very simple to test the persistent allocation system (during the life cycle of the instruction).

Yes, I think it could work! After implementing this it will be very simple to test the persistent allocation system (during the life cycle of the instruction).

Yep, separating things into smaller methods makes (almost) everything simpler.

Marking it to "Ready" as we now have a plan.

Hello guys,

I've been exploring ways to enhance the performance of the graphics vertex instructions module, particularly for operations involving large sets of points. By leveraging numpy arrays instead of standard Python lists, we can achieve significant improvements in both memory usage and processing speed.

This change primarily affects the way we store and manipulate point data in vertex classes, allowing us to utilize numpy's efficient array operations. Here's an example modification for the Triangle class, which demonstrates these benefits.

I believe this enhancement would be beneficial for Kivy users who deal with complex graphical applications and would appreciate your consideration for merging these improvements.

Enhanced Performance: Numpy arrays provide faster operations due to their optimized numerical computation capabilities, especially useful when dealing with large sets of data. Memory Efficiency: Numpy arrays use less memory and offer continuous memory allocation, improving cache efficiency and reducing overhead. Mathematical Operations: Numpy simplifies complex mathematical transformations and calculations, allowing for efficient vectorized operations. Scalability: The ability to handle high-dimensional data efficiently makes numpy highly scalable for complex or large graphics tasks. Integration: Using numpy allows for seamless integration with a wide range of scientific and analytical libraries, enhancing the functionality and capabilities of Kivy applications.

Best regards, Túlio Gomes Vuolo

Example:

Screenshot from 2024-05-05 22-06-56

After careful consideration of the proposal to include the NumPy library to optimize parts of Kivy, I've realized that this approach might present significant challenges, especially related to response times and the implications of introducing a new external dependency. Given this context, it seems more prudent to explore other alternatives that do not require such deep changes to our user base and infrastructure.

Some innovative solutions have already been discussed here, and I would like to complement those ideas with additional viable options that can be implemented using our existing resources and optimization techniques in Cython. I believe these additions could offer significant performance improvements without compromising the maintainability or accessibility of the project.

I will be sharing these suggestions in detail in the coming days so that we can discuss them more thoroughly. I am excited to hear your thoughts and collaborate in finding the best strategies to enhance our work on Kivy.

Thank you for your dedication and ongoing collaboration.

Hi @tuliogv,

Yes, adding NumPy as a new external dependency to optimize some code has the con of an increased App size. Will wait for your suggestions in the coming days to discuss them more thoroughly!

Hey everyone,

I've been following the discussions on refactoring Python arrays for better memory management, especially considering the proposed solutions and the recent implementation for the Ellipse() class. Building on this momentum, I've suggested a solution for the Rectangle() class, complete with benchmarking results.

I believe it would be beneficial for us to systematically address each class, marking them for optimization and gradually improving them one by one. By doing so, we can work towards achieving the optimal solution for memory efficiency across the board.

Looking forward to collaborating on this optimization journey!

Best regards,

kivy / kivy

Graphics: Storing `_points` for `Smooth*` implementations result in a performance drop even on non-smooth drawings #8664