Issue №3258 opened by itzpr3d4t0r at 2022-06-25 18:25:58
Follows # 3257 .
Rationale
The Surface.blits() function is exceptional at providing freedom to the user as to how a sequence of surfaces are blit, letting them decide on a per-surface basis whether to draw just a portion of them, using a different blend-type or both. It's a very good general function, but as being so, it also pays the cost of "checking what we have and choose what to do" on a per surface basis. We need something stricter, yet more performant.
Idea
What I propose is a narrower blits() function, that I'll call ublits(), where the "u" stands for uniform, that will draw every Surface in the sequence fully (meaning area=None) while using the same blend-mode for each one, therefore having a more delimited behaviour, but with more performance overall.
Implementation
I'll start from the general to then go more in depth:
From the user perspective
The final user would implement something that follows the "signature":
assert(itemlength >= 2); , as the items from the sequence can't be "less than" a (Surface, dest) pair, but can be more (line 1959)
Check if (itemlength >= 3) if so -> argrect = PySequence_GetItem(item, 2); (line 1965)
Check if (itemlength >= 4) if so -> special_flags= PySequence_GetItem(item, 3); (line 1969)
Do validity checks on the extracted argrect, set it to NULL at the start of the while loop, manage it and throw errors otherwise. (line 2004)
Do validity checks for the special_flags variable (line 2022)
So to summarize there's a lot of getting, setting and checking for every surface in the sequence. All this with a specialized version of blits() would not be needed, not for each surface at least.
Potential ublits() implementation
The proposed implementation is basically a stripped down version of the current blits() function:
assert(itemlength >= 2); has to become assert(itemlength == 2);
Checks like if (itemlength >= 3) {...} and if (itemlength == 4) {...} are no longer needed:
argrect will always be None( NULL ) so it doesn't need to be managed (at line 2004)
special_flags don't need to be validated for each surface and will be validated only once, outside the while loop
Right now
PyObject *special_flags = NULL;
...
while((item = PyIter_Next(iterator))) {
...
if (special_flags) { // line 2022
if (!pg_IntFromObj(special_flags, &the_args)) {
bliterrornum = BLITS_ERR_MUST_ASSIGN_NUMERIC;
goto bliterror;
}
}
...
}
Becomes
PyObject *special_flags = blend_flags; // blend flags are the blend_flags passed to the func.'s kwargs params. i don't know how to get it using python funcs but that's what i mean.
...
if (special_flags) {
if (!pg_IntFromObj(special_flags, &the_args)) {
bliterrornum = BLITS_ERR_MUST_ASSIGN_NUMERIC;
goto bliterror;
}
}
while((item = PyIter_Next(iterator))) {
...
}
And i'm sure there are other things that can benefit some love but i still don't know enough python C functions to say...
A hard topic
I'll throw it out there, currently if we want to return a list of changed rects we can, but this is another thing we check for each surface and it's unnecessary imo. we could further speed up the code by removing these evaluations and moving them to a separate function for getting the rects. But i understand that that would mean too many functions to maintain and possibly update.
I'm reading these issue posts, just to let you know.
I'm planning to do another round of SIMD performance optimisations on some of the more 'core' blitting code in a future version of pygame once the current changes adding AVX2 to the special blend modes have had a chance to bed in (i.e. they've been tested in general use for a good while).
I may look into blits/ublits then as well - but if you want to try a PR yourself (with some accompanying performance testing) then you are also welcome to do that. I tend to work fairly slowly these days as I have lots keeping me busy.
@MyreMylar Yes sure, i already tried messing up with the code and ended up with a working function, but i still have to check whether the returned list of rects is correct.
If you're trying to do a "restricted" blit function for higher performance, you could remove keyword arguments and use METH_FASTCALL as your type of python function. (Removing keywords because there isn't a good way to parse keywords with METH_KEYWORDS and FASTCALL together.
Currently you have ublits using METH_VARARGS | METH_KEYWORDS.
If you're trying to do a "restricted" blit function for higher performance, you could remove keyword arguments and use METH_FASTCALL as your type of python function. (Removing keywords because there isn't a good way to parse keywords with METH_KEYWORDS and FASTCALL together.
Currently you have ublits using METH_VARARGS | METH_KEYWORDS.
src_c/surface.c(2139): error C2440: 'function': impossibile to convert from 'PyObject' to 'PyObject *'
src_c/surface.c(2139): warning C4024: 'function through pointer': different types between formal parameter 1 and the effective one
how didn't i think about it? well i resolved it but i can't join now, i'll be very busy for the next 4 hours approx. so if you're still on we can discuss this function overall.
Did you profile the code? Are these really the slowdowns?
`has to do the following for each and every item from the sequence:
assert(itemlength >= 2); , as the items from the sequence can't be "less than" a (Surface, dest) pair, but can be more (line 1959)
Check if (itemlength >= 3) if so -> argrect = PySequence_GetItem(item, 2); (line 1965)
Check if (itemlength >= 4) if so -> special_flags= PySequence_GetItem(item, 3); (line 1969)
Do validity checks on the extracted argrect, set it to NULL at the start of the while loop, manage it and throw errors otherwise. (line 2004)
Do validity checks for the special_flags variable (line 2022)
So to summarize there's a lot of getting, setting and checking for every surface in the sequence. All this with a specialized version of blits() would not be needed, not for each surface at least.`
I see that processing a list of variable length tuples needs lots of checking. Maybe this could be avoided with some default values? If a named tuple supports default values it could be used so the sequence should always be a 4 tuple (all parameters) only needing minimal or no checks at all. This brings me to another idea: have not checks in this method and a separate validation method perhaps? So in 'debug' mode or if something goes wrong the validation method could give more information what is wrong.
I looked it up: named tuples support default values. Thinking about it again maybe an interface would be nicer to extend with other properties/attributes. If there were an interface then the Sprite classes could implement them. Or other classes and it would be clear that this object can be blitted.
Hi and thank you for your suggestion. First and foremost, i want to be clear, this is not meant to be an inclusive function, but its purpose is specific and ideally it would be "Just draw the sequence as fast as possible, as simple as possible, without bothering with additional parameters and flexibility". For that i think we're already fine with blits().
Thinking about it again maybe an interface would be nicer to extend with other properties/attributes. If there were an interface then the Sprite classes could implement them
Here you are overthinking the purpose of this function, i intend it as more of a resource that you can use when you want some more efficiency and readability (because smaller tuples means less mess). I plan to remove doreturn for example, but that's not something i'll decide on my own.
The function isn't that complex to see where some gains could be made. What i can only think of doing with my very basic skills, is working to get some speed back while looping through the sequence. That you can do by lowering the number of checks and "GET_ITEM" calls for example, but also optimizing the type of sequence we use aswell. These aren't really the biggest slowdown(the biggest slowdown is of course the blit), infact on my currently uncommitted version of the function, we loop through an array with PySequence_Fast_ITEMS, and we get a 13-14% better perf on average, and it's about the most i can strip down from the function whithout making it unsafe. I didn't profile the code but this was the thought behind what i wrote.
What you suggest you could submit to someone who knows a bit more about python C ABI, as i'm a newbie and i may be very wrong without even knowing.
If a named tuple supports default values it could be used so the sequence should always be a 4 tuple (all parameters) only needing minimal or no checks at all
Do you mean that we should expect to get a sequence of named tuples of 4 items each? but wouldn't that defeat the purpose anyway? because you'd have to extract all four values from the tuple for each surface and would be the same as blits. Why extract the value None and pygame.BLEND_ADD from N tuples when you can set both once?
Generally speaking the idea is getting the fastest looping we can possibly get, with fewer checks and getitem(s) as we possibly can, while waiting for a faster implementation of the blit function. This PR would be helpful to that extent because you'd already have a fast loop to begin with. If the loop is slow and the blit is very fast then it could become a bottleneck(exaggerating here).
Issue №3258 opened by itzpr3d4t0r at 2022-06-25 18:25:58
Follows # 3257 .
Rationale
The Surface.blits() function is exceptional at providing freedom to the user as to how a sequence of surfaces are blit, letting them decide on a per-surface basis whether to draw just a portion of them, using a different blend-type or both. It's a very good general function, but as being so, it also pays the cost of "checking what we have and choose what to do" on a per surface basis. We need something stricter, yet more performant.
Idea
What I propose is a narrower blits() function, that I'll call ublits(), where the "u" stands for uniform, that will draw every Surface in the sequence fully (meaning area=None) while using the same blend-mode for each one, therefore having a more delimited behaviour, but with more performance overall.
Implementation
I'll start from the general to then go more in depth:
From the user perspective
The final user would implement something that follows the "signature":
So we want:
Current blits() slowdowns
The current blits() implementation : https://github.com/pygame/pygame/blob/d31eae2c804592a1b1875122c139e209055fa90e/src_c/surface.c# L1902 has to do the following for each and every item from the sequence:
assert(itemlength >= 2);
, as the items from the sequence can't be "less than" a(Surface, dest)
pair, but can be more (line 1959)if (itemlength >= 3)
if so ->argrect = PySequence_GetItem(item, 2);
(line 1965)if (itemlength >= 4)
if so ->special_flags= PySequence_GetItem(item, 3);
(line 1969)argrect
, set it toNULL
at the start of the while loop, manage it and throw errors otherwise. (line 2004)special_flags
variable (line 2022)So to summarize there's a lot of getting, setting and checking for every surface in the sequence. All this with a specialized version of blits() would not be needed, not for each surface at least.
Potential ublits() implementation
The proposed implementation is basically a stripped down version of the current blits() function:
assert(itemlength >= 2);
has to becomeassert(itemlength == 2);
Checks like
if (itemlength >= 3) {...}
andif (itemlength == 4) {...}
are no longer needed:argrect
will always beNone
(NULL
) so it doesn't need to be managed (at line 2004)Right now
Becomes
special_flags
don't need to be validated for each surface and will be validated only once, outside the while loopRight now
Becomes
And i'm sure there are other things that can benefit some love but i still don't know enough python C functions to say...
A hard topic
I'll throw it out there, currently if we want to return a list of changed rects we can, but this is another thing we check for each surface and it's unnecessary imo. we could further speed up the code by removing these evaluations and moving them to a separate function for getting the rects. But i understand that that would mean too many functions to maintain and possibly update.
Comments
*MyreMylar commented at 2022-06-26 09:47:09*
I'm reading these issue posts, just to let you know.
I'm planning to do another round of SIMD performance optimisations on some of the more 'core' blitting code in a future version of pygame once the current changes adding AVX2 to the special blend modes have had a chance to bed in (i.e. they've been tested in general use for a good while).
I may look into blits/ublits then as well - but if you want to try a PR yourself (with some accompanying performance testing) then you are also welcome to do that. I tend to work fairly slowly these days as I have lots keeping me busy.
*itzpr3d4t0r commented at 2022-06-26 09:51:42*
@MyreMylar Yes sure, i already tried messing up with the code and ended up with a working function, but i still have to check whether the returned list of rects is correct.
*itzpr3d4t0r commented at 2022-06-26 09:53:25*
What i intended to have is a somewhat simpler to use function that is at least equal if not a little better than blits().
*Starbuck5 commented at 2022-06-26 21:34:14*
If you're trying to do a "restricted" blit function for higher performance, you could remove keyword arguments and use METH_FASTCALL as your type of python function. (Removing keywords because there isn't a good way to parse keywords with METH_KEYWORDS and FASTCALL together.
Currently you have ublits using
METH_VARARGS | METH_KEYWORDS
.See: https://docs.python.org/3/c-api/structures.html# METH_FASTCALL
*itzpr3d4t0r commented at 2022-06-27 06:13:31*
I've tried implementing it but there must be something i'm missing here...
What am i missing?
*Starbuck5 commented at 2022-06-27 06:22:43*
METH_FASTCALL doesn't give you a pyobject args pointer. It gives you a c array of pyobjects and a py_ssize_t number of arguments in said array.
*itzpr3d4t0r commented at 2022-06-27 06:37:54*
Nevermind i found the issue, but i get
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, matplotlib._c_internal_utils, PIL._imaging, matplotlib._path, kiwisolver._cext, pygame.base, pygame.constants, pygame.rect, pygame.rwobject, pygame.surflock, pygame.bufferproxy, pygame.math, pygame.surface, pygame.display, pygame.draw, pygame.event, pygame.imageext, pygame.image, pygame.joystick, pygame.key, pygame.mouse, pygame.time, pygame.mask, pygame.pixelcopy, pygame.transform, pygame.font, pygame.mixer_music, pygame.mixer, pygame.scrap, pygame.context, matplotlib._image, pygame._freetype (total: 44)
this when i run ublitsOk now i have something like this:
and it's still not working, it says:
src_c/surface.c(2139): error C2440: 'function': impossibile to convert from 'PyObject' to 'PyObject *' src_c/surface.c(2139): warning C4024: 'function through pointer': different types between formal parameter 1 and the effective one
*Starbuck5 commented at 2022-06-27 06:50:23*
->
You're trying to get pyobject pointers out of an array of pyobject pointers, index the array.
Also, this would be a great thing to discuss on discord. Better back and forth than a github comment section.
*itzpr3d4t0r commented at 2022-06-27 07:04:59*
how didn't i think about it? well i resolved it but i can't join now, i'll be very busy for the next 4 hours approx. so if you're still on we can discuss this function overall.
*dr0id commented at 2022-06-27 19:17:03*
Hi, thanks for your efforts.
Did you profile the code? Are these really the slowdowns? `has to do the following for each and every item from the sequence:
So to summarize there's a lot of getting, setting and checking for every surface in the sequence. All this with a specialized version of blits() would not be needed, not for each surface at least.`
I see that processing a list of variable length tuples needs lots of checking. Maybe this could be avoided with some default values? If a named tuple supports default values it could be used so the sequence should always be a 4 tuple (all parameters) only needing minimal or no checks at all. This brings me to another idea: have not checks in this method and a separate validation method perhaps? So in 'debug' mode or if something goes wrong the validation method could give more information what is wrong.
I looked it up: named tuples support default values. Thinking about it again maybe an interface would be nicer to extend with other properties/attributes. If there were an interface then the Sprite classes could implement them. Or other classes and it would be clear that this object can be blitted.
Just my thoughts.
*itzpr3d4t0r commented at 2022-06-27 20:03:05*
Hi and thank you for your suggestion. First and foremost, i want to be clear, this is not meant to be an inclusive function, but its purpose is specific and ideally it would be "Just draw the sequence as fast as possible, as simple as possible, without bothering with additional parameters and flexibility". For that i think we're already fine with blits().
Here you are overthinking the purpose of this function, i intend it as more of a resource that you can use when you want some more efficiency and readability (because smaller tuples means less mess). I plan to remove
doreturn
for example, but that's not something i'll decide on my own.The function isn't that complex to see where some gains could be made. What i can only think of doing with my very basic skills, is working to get some speed back while looping through the sequence. That you can do by lowering the number of checks and "GET_ITEM" calls for example, but also optimizing the type of sequence we use aswell. These aren't really the biggest slowdown(the biggest slowdown is of course the blit), infact on my currently uncommitted version of the function, we loop through an array with
PySequence_Fast_ITEMS
, and we get a 13-14% better perf on average, and it's about the most i can strip down from the function whithout making it unsafe. I didn't profile the code but this was the thought behind what i wrote. What you suggest you could submit to someone who knows a bit more about python C ABI, as i'm a newbie and i may be very wrong without even knowing.*itzpr3d4t0r commented at 2022-06-27 20:12:47*
Do you mean that we should expect to get a sequence of named tuples of 4 items each? but wouldn't that defeat the purpose anyway? because you'd have to extract all four values from the tuple for each surface and would be the same as blits. Why extract the value None and pygame.BLEND_ADD from N tuples when you can set both once?
*itzpr3d4t0r commented at 2022-06-27 20:17:09*
Generally speaking the idea is getting the fastest looping we can possibly get, with fewer checks and getitem(s) as we possibly can, while waiting for a faster implementation of the blit function. This PR would be helpful to that extent because you'd already have a fast loop to begin with. If the loop is slow and the blit is very fast then it could become a bottleneck(exaggerating here).