Open RandomGamingDev opened 9 months ago
Can you elaborate a bit on what exactly gets batched? In your other proposal you mentioned begin/endShape, would this mean combining adjacent draw calls into one shape if possible?
Some follow-ups:
Can you elaborate a bit on what exactly gets batched? In your other proposal you mentioned begin/endShape, would this mean combining adjacent draw calls into one shape if possible?
For instance, in the canvas API beginPath()
and closePath()
are used for batching so that you can render things with the same strokes and fill all at the same time, and you can do something similar with batching the vertices together into a single VAO so that they all get sent to the GPU at the same time. What I'm suggesting is that we keep track of the shapes so that we can batch them automatically.
Some follow-ups:
* Is this for both 2D and WebGL?
This is for both 2D and WebGL, since both support batching and aren't done in p5.js
* Is this applicable to things that change colors, materials, transforms, etc between shapes?
For the Canvas API, probably not for the colors and materials part, but for WebGL maybe. Otherwise for things like just drawing different shapes, yes those would be batched.
I think when evaluating this, we'd probably be weighing it against maybe less automatic solutions, like documenting manual batching using existing p5 tools as a performance optimization in the new set of tutorials we're currently writing for the in-progress new p5js.org site. Possibly also expanding the use of begin/endContour
to make that easier too, as those function like moveTo
in the native canvas API.
So two questions to be answered, I think, are (1) what the added engineering complexity of batching would be, and (2) can we handle enough cases to make that complexity worth it. For (1), I gather we'd have to store info about the draw call and avoid actually drawing it until we hit a new draw call that can't be grouped, or if we reach the end of draw()
? And for (2), I think it wouldn't be able to support changes in blend modes, textures, line thickness, and probably color (definitely in 2D, and even for WebGL where we have vertex colors, if we have multiple colors per batch but they are all drawn at once, can we deterministically ensure the last one the user asked to draw ends up on top?) We'd probably also have to see whether transforms are worth being flattened into one shape, or if doing the transforms to each vertex on the CPU slows down complex shapes enough that it removes the benefit of batching. That tradeoff is clearly worth it when making a shape you'll continue to reuse, like in buildGeometry
, but if the transforms change each frame and we have to keep incurring the cost, it may not be.
I think it's definitely worth surfacing manual batching in our docs and tutorials, and if investigation shows that we can get some good across-the-board improvements without needing a big overhaul of the rendering pipeline, that would be great too.
So two questions to be answered, I think, are (1) what the added engineering complexity of batching would be, and (2) can we handle enough cases to make that complexity worth it.
For (1), I gather we'd have to store info about the draw call and avoid actually drawing it until we hit a new draw call that can't be grouped, or if we reach the end of draw()?
draw()
call with all of the draw calls which is what I'm thinking of.And for (2), I think it wouldn't be able to support changes in blend modes, textures, line thickness, and probably color (definitely in 2D, and even for WebGL where we have vertex colors, if we have multiple colors per batch but they are all drawn at once, can we deterministically ensure the last one the user asked to draw ends up on top?)
We'd probably also have to see whether transforms are worth being flattened into one shape, or if doing the transforms to each vertex on the CPU slows down complex shapes enough that it removes the benefit of batching. That tradeoff is clearly worth it when making a shape you'll continue to reuse, like in buildGeometry, but if the transforms change each frame and we have to keep incurring the cost, it may not be.
It's definitely worth it to flatten it to all go into one buffer. We're already calculating the vertices so compressing them all into one shape shouldn't pose an issue especially since the transforms like 3d to 2d projection should for the most part still be done on the GPU for WebGL and doing the same thing, but using paths and moving around beginPath()
and endPath()
to draw things of the same stroke and color in a single draw call. Also, the thing is that this would divide on the cost. Instead of the multiple GPU calls currently slowing down programs, we're substituting them with adding to a CPU buffer and then rendering them all at the same time in a few or a single draw call at the end of draw()
. Even the cost incurred by adding to a buffer instead is negligible to the possible performance gains and when especially considering the global nature of p5.js operations like fill()
which already encourage users to write their code so that shapes of the same color are drawn together at the same time, which means that batching will come easily to many sketches.
I think it's definitely worth surfacing manual batching in our docs and tutorials, and if investigation shows that we can get some good across-the-board improvements without needing a big overhaul of the rendering pipeline, that would be great too.
It depends on what you consider large, but I think that the overhaul should be short of that and should be an above medium pipeline change.
My only issue with batching is how we will detect drawing multiple shapes with strokes that touch one another with the Canvas API since it'll affect the strokes if we don't separate those draw calls. For this, I see 2 main paths:
since WebGL has depth and depth buffers that can be used for ensuring that the last thing drawn ends up at the top even if rendered out of order.
This isn't true for blending transparency: some blend modes like ADD are commutative, but the default BLEND mode isn't. Also, when things are drawn at the same Z value, we use the painter's algorithm, so things the user asks to be drawn last end up on top. If everything ends up in one draw call, and everything is at the same z, it's unclear that we're able to preserve this property.
It's definitely worth it to flatten it to all go into one buffer. We're already calculating the vertices so compressing them all into one shape shouldn't pose an issue especially since the transforms like 3d to 2d projection should for the most part still be done on the GPU for WebGL and doing the same thing, but using paths and moving around beginPath() and endPath() to draw things of the same stroke and color in a single draw call.
If we want everything to be one buffer that we draw as one call, in WebGL, we need to deal with transforms. Currently, they're a uniform, so they don't change within a batch of vertices being drawn. I think we need to either:
(1) can be very costly. In this demo, I draw a bunch of bouncing balls, drawn as separate shapes, and drawn by flattening the transforms into one big vertex array using buildGeometry
(since it already does this): https://editor.p5js.org/davepagurek/sketches/FAqbP5k8i Try toggling the TEST_CPU_TRANSFORMS
constant to true to see how much the frame rate drops--for me, it drops from 60 to 20, and the majority of that time is doing the transform flattening:
(2) is also costly as it sends a lot more data to the GPU by duplicating a transformation matrix per vertex. We did some similar tests when trying to optimize line rendering with instancing. On some computers, the duplication was fast and worth it overall, but on others, it made things much slower. Also, I don't think we can use WebGL stride and such to avoid the duplication because the shapes might not have the same number of vertices.
(3) would work around this issue, but puts some big limits on the usefulness of batching, since translate()
and other transforms are very common in p5 code.
This isn't true for blending transparency: some blend modes like ADD are commutative, but the default BLEND mode isn't. Also, when things are drawn at the same Z value, we use the painter's algorithm, so things the user asks to be drawn last end up on top. If everything ends up in one draw call, and everything is at the same z, it's unclear that we're able to preserve this property.
If the rendering depends on the values of the pixels being rendered over and has to be done in order, we can avoid batching those, although other types of blending would still work with batching. Otherwise, we can rely on the depth value in order to calculate what to do.
If we want everything to be one buffer that we draw as one call, in WebGL, we need to deal with transforms. Currently, they're a uniform, so they don't change within a batch of vertices being drawn. I think we need to either:
- apply the transform to all the vertices before sending it to the GPU
- Add a matrix along with each vertex (similar to normals)
- Split batches on transforms
The main transformations can be easily done on the CPU since they don't rely on complex operations which means that they can still be batched in order to shorten the amount of draw calls. However, when dealing with more complex operations like 3d rotations, especially ones utilizing matrices (like the example you gave), which could better utilize the GPU's resources, we can just batch the ones with say, the same rotation, still reducing the number of draw calls while making sure that the GPU still does the transform operations, which should still provide better performance, or at least a performance deficit that's nearly negligible in the worst case scenario.
I think it's worth trying, although I'm anticipating there being a point at which the batching process itself starts to have more overhead than the separate batch calls + GPU transforms would, and I'd like to keep it from introducing slowness in those cases. Especially as there are many shapes, collision detection becomes more and more of a problem (games use spatial partitioning algorithms to help with that, but even coming up with partitions involves going over all the points in the shape, and some shapes can get big.) Setting a reasonable limit on the size of a batch could maybe help with that.
I think it's worth considering that it might be an easier and more effective change to let users do batching themselves:
I think it's worth trying, although I'm anticipating there being a point at which the batching process itself starts to have more overhead than the separate batch calls + GPU transforms would, and I'd like to keep it from introducing slowness in those cases.
Because we're only talking about doing the main and simple transformations on the CPU, while more complex transformations that rely on things like matrix multiplications are done on the GPU, with the different GPU transformations having their own batches, it'd require an extreme and unusual condition like hundreds of translate and rotation statements getting called in an alternating order for there to be a significant enough deficit in performance to make the batching approach less performant compared to the non-batching one, which is unlikely in at least 99% of the real-world usage of p5.js.
Especially as there are many shapes, collision detection becomes more and more of a problem (games use spatial partitioning algorithms to help with that, but even coming up with partitions involves going over all the points in the shape, and some shapes can get big.) Setting a reasonable limit on the size of a batch could maybe help with that.
Even with collision detection, batching should still provide more than enough of a significant advantage to performance. We'll want to set a limit on the size of a batch, but it'll be massive and something that a lot of sketches won't reach, as realistically, if the detection is well optimized with say, quad-trees for collision, the time spent trying to detect collisions would still be less than the cost of separating everything into more draw calls. Plus, collision detection will only have to be done on shapes with strokes and that use the CanvasAPI, with the cost in performance being nearly insignificant unless the shapes are always in the same sector of the quadtree and require a more expensive check or just forgoing it and rendering the batch before moving on to another. All in all, collision detection is only needed for certain shapes, only has significant cost for some of them in certain conditions, and still provides a performance increase compared to the alternative, making it absolutely still worth it in this case. Although, we likely won't be doing collision detection for some of the custom shapes created with things like beginShape()
and endShape()
, especially if they're convex.
I think it's worth considering that it might be an easier and more effective change to let users do batching themselves:
- It's easy to turn off the batching if you find it doesn't speed things up
- People might be able to use it in a scenario where we can't if we're doing it automatically -- e.g. maybe we'd think shapes collide based on convex hulls, but actually the shapes are not convex, and the sketch author knows they won't intersect
- We could be OK with blending, z order, etc breaking in those cases: it would be up to the user to batch the things where they know it's safe to batch or don't care about which one draws on top (and then we don't need lots of code to check for those cases)
While letting the user batch everything themselves would work for users that really want to optimize their performance, it'd be unnecessarily confusing for them and be an unnecessary loss in performance in nearly all cases for anyone not using these features from not batching like many other rendering libraries do. I think that, while manual batching options should be offered (especially better ones than just beginShape()
& endShape()
) like instancing in order to allow for hand-tuned performance especially when dealing with things like things that rely on the pixels before them (like blending), I think that we should try to automatically handle batching in order to maximize performance for sketches without that since it shouldn't make any significant changes to them other than the performance increase. Also, we should give the ability to disable and enable batching if wanted even if we use the automatic mode, not only for easier bug testing, but edge cases as well where the user might want to say, write an extension for p5.js without batching. I think that an enableBatching()
and disableBatching()
as well as a flushBatch()
/renderBatch()
function for forcing p5.js to render the batch early and start a new batch would be great options.
Plus, collision detection will only have to be done on shapes with strokes and that use the CanvasAPI
I'm not sure I'm following why it only applies to those. since the last drawn shape at the same z value should go on top, shouldn't all shapes need to know if they'll collide with something else in the batch?
it'd require an extreme and unusual condition like hundreds of translate and rotation statements getting called in an alternating order for there to be a significant enough deficit in performance to make the batching approach less performant compared to the non-batching one
If we have more than one transform in the batch, I think we have to multiply all points by the current matrix (possibly using a simpler method like you described to avoid actually doing a matrix multiply), but in that case, it mostly only matters how many points we multiply. (I suppose each time the matrix changes, we have to check if it's a simple matrix or not, and maybe that cost could add up? But I'm less concerned about that than about vertex counts.)
Because we're only talking about doing the main and simple transformations on the CPU, while more complex transformations that rely on things like matrix multiplications are done on the GPU, with different transformations having their own batches, it'd require an extreme and unusual condition like hundreds of translate and rotation statements getting called in an alternating order for there to be a significant enough deficit in performance to make the batching approach less performant compared to the non-batching one, which is unlikely in at least 99% of the real-world usage of p5.js.
Here's a demo that doesn't do matrix transforms, it just handles translation: https://editor.p5js.org/davepagurek/sketches/lPfCkf4vk You can play around with how many points per particle there are, and how many particles, to see how it affects the frame rate. For small numbers, batching is fine, but it starts to lag as it gets higher. On my computer, these all are handled fine using the current rendering method. A high number of points per particle isn't unusual, especially if shapes are drawn with curves. This is also with no collision detection yet.
Again, I think this is still worth trying, but it's not a clear cut across-the-board gain. I think when determining when to flush a batch, we'd need to consider:
applyMatrix
with a custom matrix?)Also, because I think this wouldn't be a breaking change, this doesn't have the same somewhat tight deadline that other 2.0 proposals have (we want to put breaking api changes into the 2.0 release to comply with semantic versioning), so although I feel that this needs R&D and isn't as straightforward as other proposals, I'm down to help guide any of that R&D if contributors are interested in trying it, and we're free to finish it in a 2.x release if we need to.
I'm not sure I'm following why it only applies to those. since the last drawn shape at the same z value should go on top, shouldn't all shapes need to know if they'll collide with something else in the batch?
For both the canvas API and WebGL, no (I'm assuming that you don't mean an actual separate "z value" for the CanvasAPI and are including it in this comment). An important piece of information is that batches should share certain attributes including fill color (we're assuming that color's opaque since we likely won't batch transparent objects). We can deal with batching on the Canvas API by creating a batch buffer, adding shapes to the batch as the code executes, and when once one of the attributes change, rendering the batch before going on to create another one in the same fashion. This ensures that everything that depends on the z (aka order of rendering for the Canvas API) due a change in attribute gets rendered correctly while still allowing for batching of shapes, whose shared attributes allows for batched rendering. I'd recommend making WebGL deal with it in about the same way, but we could also just add a z (fragment depth) value into the buffer so that things are rendered the same even if it's out of order if you think that'd be better.
The only major issue after that, and the reason why I'm talking about collision detection for shapes with strokes is because of this:
Batched Circles
Unbatched Circles
Both of these were rendered with the Canvas API in this example: https://editor.p5js.org/PotatoBoy/sketches/bK7dW9odY
As you can see, by batching shapes within a single path, any shapes that have strokes get their strokes combined due to them sharing a path, which is why we need to make sure that we don't end up batching shapes like this and accidentally render something like the Batched Circles
example shown above.
If we have more than one transform in the batch, I think we have to multiply all points by the current matrix (possibly using a simpler method like you described to avoid actually doing a matrix multiply), but in that case, it mostly only matters how many points we multiply. (I suppose each time the matrix changes, we have to check if it's a simple matrix or not, and maybe that cost could add up? But I'm less concerned about that than about vertex counts.)
I was thinking about avoiding using matrices for transforms where they aren't necessary automatically instead of checking whether it's simple or not since it's often more efficient to not use them for transformations. However, as a side note, I do think that rotations with values of:
Here's a demo that doesn't do matrix transforms, it just handles translation: https://editor.p5js.org/davepagurek/sketches/lPfCkf4vk You can play around with how many points per particle there are, and how many particles, to see how it affects the frame rate. For small numbers, batching is fine, but it starts to lag as it gets higher. On my computer, these all are handled fine using the current rendering method. A high number of points per particle isn't unusual, especially if shapes are drawn with curves. This is also with no collision detection yet.
That example uses a translation matrix which wasn't really what I had intended. Here's an example that doesn't utilize a transformation matrix for the transform operation https://editor.p5js.org/PotatoBoy/sketches/XFOO0ynfG and here's the results with the example you sent (they're also in the comments below the sketch's code):
CPU transforms
1. 13.633110270584812
2. 13.688285130002921
3. 13.662770309760374
Avg: 13.66138857
GPU transforms
1. 13.844476744186046
2. 13.643759177679883
3. 13.72785361603253
Avg: 13.738696513
CPU outperforms GPU by 0.077307943 ms/frame
As you can see, with simpler transformations, like the single translate
operation for each particle, the CPU performs close to, if it doesn't outperform the GPU's operations. These are the operations I'm talking about including multiple of within a single batch through software transformations instead of hardware based matrix ones.
Again, I think this is still worth trying, but it's not a clear cut across-the-board gain. I think when determining when to flush a batch, we'd need to consider:
- At a certain number of vertices, don't try to do anything on the CPU
- How much of an added cost it is to do the collision detection
- How costly it is to check if a matrix is simple, and what counts as simple (is it just 2D matrices? do we special case non-rotation matrices? should we give up immediately whenever someone calls applyMatrix with a custom matrix?)
- How do we handle custom shaders that might affect the positioning and therefore correctness of the collision detection? Do we give up immediately here too?
- In real world sketches, are bounding boxes too loose and cause us to lose all performance gains by having boxes intersect and flush batches? If so, are there other collision checks that we can do that don't add too much overhead?
(I added numbering so that it's more easy to see what's for what)
translate
and scale
are going to be some of the best and easiest transformations to do in software, although I think that we can and should include some rotations. As for applyMatrix()
, I think that we should flush & render the current buffer once it's called, and just start building a new one based on the new transformation given by applyMatrix()
.Also, because I think this wouldn't be a breaking change, this doesn't have the same somewhat tight deadline that other 2.0 proposals have (we want to put breaking api changes into the 2.0 release to comply with semantic versioning), so although I feel that this needs R&D and isn't as straightforward as other proposals, I'm down to help guide any of that R&D if contributors are interested in trying it, and we're free to finish it in a 2.x release if we need to.
While I do agree that this can be paused (p5.js has been fine without batching for quite a while after all), I'd still like to note that this will still require major pipeline changes in order to facilitate this and that 2.0 is prioritizing performance for its new choices like math libraries, which a rendering optimization like the addition of batching would fit perfectly into.
We can deal with batching on the Canvas API by creating a batch buffer, adding shapes to the batch as the code executes, and when once one of the attributes change, rendering the batch before going on to create another one in the same fashion.
Makes sense, I was still thinking of batching color in my comment.
The only major issue after that, and the reason why I'm talking about collision detection for shapes with strokes is because of this:
That example uses a translation matrix which wasn't really what I had intended. Here's an example that doesn't utilize a transformation matrix for the transform operation https://editor.p5js.org/PotatoBoy/sketches/XFOO0ynfG and here's the results with the example you sent (they're also in the comments below the sketch's code):
While I do agree that this can be paused (p5.js has been fine without batching for quite a while after all), I'd still like to note that this will still require major pipeline changes in order to facilitate this and that 2.0 is prioritizing performance for its new choices like math libraries, which a rendering optimization like the addition of batching would fit perfectly into.
So I'm mostly looking at this from a project management perspective. Like you said, there are a number of major changes required to make all this work, and so far it's all being discussed as one change. The danger is that, because this proposal doesn't exist in a vacuum, it has to compete for time resources with other proposals, and also involves modifying the same code as other proposals. To put this on as good footing as possible, I think it's best to divide this up into different chunks that provide incremental benefit on their own so that we aren't having to choose between doing all of this or doing none of it, and provide a clear path so that we can still easily continue down this road even if we aren't able to get to all of it in 2.0.
Here's an example of what I'm thinking. Each of these is a fairly large chunk of engineering, also ensuring it's well tested and correct, but still providing a tangible benefit:
p5.Matrix
whose multiply implementation only modifies e.g. position values and a scale value internallyp5.Matrix
interface to make sure everything is always correct, leaving us the option of replacing those methods with more efficient versions as necessaryHow do you feel about this structure and ordering?
All of this sounds pretty good, although I do think that this shouldn't just be an optional included module since it's meant to impact the average sketch, not just performance conscious coders who already know how to properly optimize their code.
Increasing access
This would increase the speed of projects.
Which types of changes would be made?
Most appropriate sub-area of p5.js?
What's the problem?
Currently projects with multiple shapes drawn can be very slow. This is largely due to draw calls.
What's the solution?
This can be partially fixed with instancing like seen here: https://github.com/processing/p5.js/pull/6276 and in issue https://github.com/processing/p5.js/issues/6275, but another effective strategy, especially for those who don't want to deal with more complex things like combining meshes manually or instancing would be to batch the rendering of objects together so that they may be able to be drawn within less drawn calls. This would however be quite the undertaking and require a decently substantial change to p5.js, so if it's not possible or worth it to implement this, I'd understand that too.
Pros (updated based on community comments)
Cons (updated based on community comments)
Proposal status
Under review
Note: The original non-RFC 2.0 issue is #6279