Open misl6 opened 1 month ago
@DexerBR (with no hurry) as you are the hero who recently touched these lines, do you have any suggestion on how to improve this specific area?
@DexerBR (with no hurry) as you are the hero who recently touched these lines, do you have any suggestion on how to improve this specific area?
Good catch @misl6! I can also confirm the difference in performance you mentioned.
I think an alternative implementation using Cython objects can significantly improve the performance of this code. Take, for example, the changes marked as # NOTE: alternative code
in the code below.
From my tests, the difference in performance between using this alternative implementation and not using it seems to be almost imperceptible (feel free to test it), unlike the performance difference you mentioned (current implementation using Python objects).
# NOTE: alternative code
cdef struct ListPoints:
float x
float y
cdef class Ellipse(Rectangle):
'''A 2D ellipse.
:Parameters:
`segments`: int, the default value is calculated from the range between angle.
Define how many segments are needed for drawing the ellipse.
The ellipse drawing will be smoother if you have many segments,
however you can also use this property to create polygons with 3 or more sides.
`angle_start`: float, defaults to 0.0
Specifies the starting angle, in degrees, of the disk portion.
`angle_end`: float, defaults to 360.0
Specifies the ending angle, in degrees, of the disk portion.
.. versionchanged:: 1.0.7
Added angle_start and angle_end.
.. versionchanged:: 2.2.0
The default number of segments is no longer 180, it is now calculated
according to the angle range, as this is a more efficient approach.
'''
cdef int _segments
cdef float _angle_start
cdef float _angle_end
# NOTE: alternative code
cdef ListPoints* _list_points
def __init__(self, *args, **kwargs):
Rectangle.__init__(self, **kwargs)
self.batch.set_mode('triangle_fan')
self._segments = kwargs.get('segments') or 0
self._angle_start = kwargs.get('angle_start') or 0.0
self._angle_end = kwargs.get('angle_end') or 360.0
cdef void build(self):
cdef float *tc = self._tex_coords
cdef int i, angle_dir
cdef double angle_start, angle_end, angle_range
cdef double x, y, angle, rx, ry, ttx, tty, tx, ty, tw, th
cdef double cx, cy, tangential_factor, radial_factor, fx, fy
cdef vertex_t *vertices = NULL
cdef unsigned short *indices = NULL
cdef int segments = self._segments
# reset points
# self._points = []
if self.w == 0 or self.h == 0:
return
if segments == 0 or segments < 3:
if segments != 0:
Logger.warning('Ellipse: A minimum of 3 segments is required. The default value will be used instead.')
segments = max(1, int(abs(self._angle_end - self._angle_start) / 2))
# NOTE: alternative code
free(self._list_points)
self._list_points = <ListPoints*>malloc((segments + 1) * sizeof(ListPoints))
if self._list_points == NULL:
raise MemoryError("Failed to allocate memory for points")
tx = tc[0]
ty = tc[1]
tw = tc[4] - tx
th = tc[5] - ty
angle = 0.0
rx = 0.5 * self.w
ry = 0.5 * self.h
vertices = <vertex_t *>malloc((segments + 2) * sizeof(vertex_t))
if vertices == NULL:
raise MemoryError('vertices')
indices = <unsigned short *>malloc((segments + 2) * sizeof(unsigned short))
if indices == NULL:
free(vertices)
raise MemoryError('indices')
# calculate the start/end angle in radians, and adapt the range
if self._angle_end > self._angle_start:
angle_dir = 1
else:
angle_dir = -1
# rad = deg * (pi / 180), where pi / 180 = 0.0174...
angle_start = self._angle_start * 0.017453292519943295
angle_end = self._angle_end * 0.017453292519943295
angle_range = -1 * (angle_end - angle_start) / segments
# add start vertex in the middle
x = self.x + rx
y = self.y + ry
ttx = ((x - self.x) / self.w) * tw + tx
tty = ((y - self.y) / self.h) * th + ty
vertices[0].x = <float>(self.x + rx)
vertices[0].y = <float>(self.y + ry)
vertices[0].s0 = <float>ttx
vertices[0].t0 = <float>tty
indices[0] = 0
# super fast ellipse drawing
# credit goes to: http://slabode.exofire.net/circle_draw.shtml
tangential_factor = tan(angle_range)
radial_factor = cos(angle_range)
# Calculate the coordinates for a circle with radius 0.5 about
# the point (0.5, 0.5). Only stretch to an ellipse later.
cx = 0.5
cy = 0.5
r = 0.5
x = r * sin(angle_start)
y = r * cos(angle_start)
for i in range(1, segments + 2):
ttx = (cx + x) * tw + tx
tty = (cy + y) * th + ty
real_x = self.x + (cx + x) * self.w
real_y = self.y + (cy + y) * self.h
vertices[i].x = <float>real_x
vertices[i].y = <float>real_y
vertices[i].s0 = <float>ttx
vertices[i].t0 = <float>tty
indices[i] = i
fx = -y
fy = x
x += fx * tangential_factor
y += fy * tangential_factor
x *= radial_factor
y *= radial_factor
# self._points.extend([real_x, real_y])
# NOTE: alternative code
self._list_points[i - 1].x = <float>real_x
self._list_points[i - 1].y = <float>real_y
self.batch.set_data(vertices, segments + 2, indices, segments + 2)
free(vertices)
free(indices)
All codes related to Smooth*
and AntialiasingLine
instructions would have to be updated to take into account the new structure, as a consequence, an improvement in performance could be noticed!
Did not tried it on runtime, but at first glance looks faster.
What about doing the following instead (fixes both this issue and possibly other performance issues due to the memory allocation like what discussed in #8662 ):
Let's malloc
vertices and indices when the number of segments changes (or anything else that requires to change the vertices and indices count)
Let's free
vertices and indices only when the Ellipse
is de-allocated (via __dealloc__
), or when we need to update the vertices or indices size.
Let's use vertices
to have points for the Smooth*
implementations (no extra values stored anywhere).
How does this plan sound?
- Let's
malloc
vertices and indices when the number of segments changes (or anything else that requires to change the vertices and indices count)- Let's
free
vertices and indices only when theEllipse
is de-allocated (via__dealloc__
), or when we need to update the vertices or indices size.
I find this approach interesting, although the impact on memory needs to be measured to be sure that memory consumption would not increase significantly for apps that use many primitives. I mean, memory consumption will obviously increase, but we would have to have a metric to define to what extent this increase would be acceptable. We would need to measure the performance of using this technique and see if the performance improvement is worth the increase in memory consumption.
Another possibility would be to have a kind of option that lets the user define what type of approach they want when rendering the primitives, for example: 1 - more memory consumption and better performance (hypothetically) 2 - lower memory consumption and lower performance (hypothetically)
- Let's use
vertices
to have points for theSmooth*
implementations (no extra values stored anywhere).
Interestingly, however, there would have to be a mapping to define what the external points of the primitive are. For example, to draw an ellipse, the interior vertices are discarded (360 degrees). The same logic applies to other primitives. Nothing that seems too complex, but it is something that needs to be taken into account when "re-using" vertices to draw AntialiasingLine
Another option would also be to keep the memory allocated until it is used by the Smooth*
instructions, and then it would be freed. For non-Smooth*
instructions, the memory would be deallocated in another way.
I mean, memory consumption will increase, but we would have to have a metric to define to what extent this increase would be acceptable.
Yeah, I agree. Better to fix out the _points
performance issue before, and then later check if we need to try out other performance boosts.
Another option would also be to keep the memory allocated until it is used by the Smooth instructions, and then it would be freed. For non-Smooth instructions, the memory would be deallocated in another way.
I like this approach, I guess we can do the following:
vertices
and indices
(move out from build
)vertices
and indices
(move out from build
)vertices
and indices
(move out from build
)build
, and free it manually for Smooth*
implementations, by calling the _vertices_indices_dealloc()
method.Do you think it can work?
Please remember we're in 3.0.0
, so if we need to break things to improve things, I'm all-in.
I mean, memory consumption will increase, but we would have to have a metric to define to what extent this increase would be acceptable.
Yeah, I agree. Better to fix out the
_points
performance issue before, and then later check if we need to try out other performance boosts.Another option would also be to keep the memory allocated until it is used by the Smooth instructions, and then it would be freed. For non-Smooth instructions, the memory would be deallocated in another way.
I like this approach, I guess we can do the following:
- Create a method that allocates the memory for
vertices
andindices
(move out frombuild
)- Create a method that calculates the
vertices
andindices
(move out frombuild
)- Create a method that de-allocates the memory for
vertices
andindices
(move out frombuild
)- Have a flag that decides whether to automatically de-allocate or no memory after
build
, and free it manually forSmooth*
implementations, by calling the_vertices_indices_dealloc()
method.Do you think it can work?
Please remember we're in
3.0.0
, so if we need to break things to improve things, I'm all-in.
Yes, I think it could work! After implementing this it will be very simple to test the persistent allocation system (during the life cycle of the instruction).
Yes, I think it could work! After implementing this it will be very simple to test the persistent allocation system (during the life cycle of the instruction).
Yep, separating things into smaller methods makes (almost) everything simpler.
Marking it to "Ready" as we now have a plan.
Hello guys,
I've been exploring ways to enhance the performance of the graphics vertex instructions module, particularly for operations involving large sets of points. By leveraging numpy arrays instead of standard Python lists, we can achieve significant improvements in both memory usage and processing speed.
This change primarily affects the way we store and manipulate point data in vertex classes, allowing us to utilize numpy's efficient array operations. Here's an example modification for the Triangle
class, which demonstrates these benefits.
I believe this enhancement would be beneficial for Kivy users who deal with complex graphical applications and would appreciate your consideration for merging these improvements.
Enhanced Performance: Numpy arrays provide faster operations due to their optimized numerical computation capabilities, especially useful when dealing with large sets of data. Memory Efficiency: Numpy arrays use less memory and offer continuous memory allocation, improving cache efficiency and reducing overhead. Mathematical Operations: Numpy simplifies complex mathematical transformations and calculations, allowing for efficient vectorized operations. Scalability: The ability to handle high-dimensional data efficiently makes numpy highly scalable for complex or large graphics tasks. Integration: Using numpy allows for seamless integration with a wide range of scientific and analytical libraries, enhancing the functionality and capabilities of Kivy applications.
Best regards, Túlio Gomes Vuolo
Example:
After careful consideration of the proposal to include the NumPy library to optimize parts of Kivy, I've realized that this approach might present significant challenges, especially related to response times and the implications of introducing a new external dependency. Given this context, it seems more prudent to explore other alternatives that do not require such deep changes to our user base and infrastructure.
Some innovative solutions have already been discussed here, and I would like to complement those ideas with additional viable options that can be implemented using our existing resources and optimization techniques in Cython. I believe these additions could offer significant performance improvements without compromising the maintainability or accessibility of the project.
I will be sharing these suggestions in detail in the coming days so that we can discuss them more thoroughly. I am excited to hear your thoughts and collaborate in finding the best strategies to enhance our work on Kivy.
Thank you for your dedication and ongoing collaboration.
Hi @tuliogv,
Yes, adding NumPy as a new external dependency to optimize some code has the con of an increased App size. Will wait for your suggestions in the coming days to discuss them more thoroughly!
Hey everyone,
I've been following the discussions on refactoring Python arrays for better memory management, especially considering the proposed solutions and the recent implementation for the Ellipse() class. Building on this momentum, I've suggested a solution for the Rectangle() class, complete with benchmarking results.
I believe it would be beneficial for us to systematically address each class, marking them for optimization and gradually improving them one by one. By doing so, we can work towards achieving the optimal solution for memory efficiency across the board.
Looking forward to collaborating on this optimization journey!
Best regards,
Software Versions
During tests that later pointed out an issue regarding
GL_TRIANGLE_FAN
(See PR #8662), Cython HTML annotation and perf tests made quite clear that storing_points
forSmooth
implementations as a pythonlist
leads to a noticeable performance drop.We should investigate alternatives for storing these values, work on "faster Kivy graphics", and avoid calls when unneeded.
For testing this issue (and for PR #8662 ) the following script has been used (it's just bad code meant for stressing out our
Ellipse
implementation) by continuously scrolling up and down a huge amount of ellipses.Test configuration: Ubuntu 22.04 + i7-7700HQ CPU + Intel(R) HD Graphics 630
Test results with
master
branch: ~44 fps Test results withmaster
branch, by disabling theself._points.extend([real_x, real_y])
line on theEllipse
implementation: ~60 fpsI did not tested this specific fix on other SW/HW configurations, but I expect something similar, as only the CPU is involved here.