[p5.js 2.0 RFC Proposal]: Batching for the rendering of objects.

processing / p5.js

p5.js is a client-side JS platform that empowers artists, designers, students, and anyone to learn to code and express themselves creatively on the web. It is based on the core principles of Processing. http://twitter.com/p5xjs —

http://p5js.org/

GNU Lesser General Public License v2.1

21.7k stars 3.33k forks source link

[p5.js 2.0 RFC Proposal]: Batching for the rendering of objects. #6805

Open RandomGamingDev opened 9 months ago

RandomGamingDev commented 9 months ago

Increasing access

This would increase the speed of projects.

Which types of changes would be made?

[ ] Breaking change (Add-on libraries or sketches will work differently even if their code stays the same.)
[ ] Systemic change (Many features or contributor workflows will be affected.)
[X] Overdue change (Modifications will be made that have been desirable for a long time.)
[ ] Unsure (The community can help to determine the type of change.)

Most appropriate sub-area of p5.js?

[ ] Accessibility
[ ] Color
[X] Core/Environment/Rendering
[ ] Data
[ ] DOM
[ ] Events
[ ] Image
[ ] IO
[ ] Math
[ ] Typography
[ ] Utilities
[X] WebGL
[ ] Build process
[ ] Unit testing
[ ] Internationalization
[ ] Friendly errors
[ ] Other (specify if possible)

What's the problem?

Currently projects with multiple shapes drawn can be very slow. This is largely due to draw calls.

What's the solution?

This can be partially fixed with instancing like seen here: https://github.com/processing/p5.js/pull/6276 and in issue https://github.com/processing/p5.js/issues/6275, but another effective strategy, especially for those who don't want to deal with more complex things like combining meshes manually or instancing would be to batch the rendering of objects together so that they may be able to be drawn within less drawn calls. This would however be quite the undertaking and require a decently substantial change to p5.js, so if it's not possible or worth it to implement this, I'd understand that too.

Pros (updated based on community comments)

Speed: The performance gets increased drastically from this as can be seen here: https://github.com/processing/p5.js/issues/6527#issuecomment-1937994662, which, with p5.js's own tools reaches a ~200ms increase in speed and can be pushed much much farther.

Cons (updated based on community comments)

There aren't really any, it just changes the speed things are rendered at. The only one I can see is it being hard to implement.

Proposal status

Under review

Note: The original non-RFC 2.0 issue is #6279

davepagurek commented 9 months ago

Can you elaborate a bit on what exactly gets batched? In your other proposal you mentioned begin/endShape, would this mean combining adjacent draw calls into one shape if possible?

Some follow-ups:

Is this for both 2D and WebGL?
Is this applicable to things that change colors, materials, transforms, etc between shapes?

RandomGamingDev commented 9 months ago

Can you elaborate a bit on what exactly gets batched? In your other proposal you mentioned begin/endShape, would this mean combining adjacent draw calls into one shape if possible?

For instance, in the canvas API beginPath() and closePath() are used for batching so that you can render things with the same strokes and fill all at the same time, and you can do something similar with batching the vertices together into a single VAO so that they all get sent to the GPU at the same time. What I'm suggesting is that we keep track of the shapes so that we can batch them automatically.

Some follow-ups:
* Is this for both 2D and WebGL?

This is for both 2D and WebGL, since both support batching and aren't done in p5.js

* Is this applicable to things that change colors, materials, transforms, etc between shapes?

For the Canvas API, probably not for the colors and materials part, but for WebGL maybe. Otherwise for things like just drawing different shapes, yes those would be batched.

davepagurek commented 9 months ago

I think when evaluating this, we'd probably be weighing it against maybe less automatic solutions, like documenting manual batching using existing p5 tools as a performance optimization in the new set of tutorials we're currently writing for the in-progress new p5js.org site. Possibly also expanding the use of begin/endContour to make that easier too, as those function like moveTo in the native canvas API.

So two questions to be answered, I think, are (1) what the added engineering complexity of batching would be, and (2) can we handle enough cases to make that complexity worth it. For (1), I gather we'd have to store info about the draw call and avoid actually drawing it until we hit a new draw call that can't be grouped, or if we reach the end of draw()? And for (2), I think it wouldn't be able to support changes in blend modes, textures, line thickness, and probably color (definitely in 2D, and even for WebGL where we have vertex colors, if we have multiple colors per batch but they are all drawn at once, can we deterministically ensure the last one the user asked to draw ends up on top?) We'd probably also have to see whether transforms are worth being flattened into one shape, or if doing the transforms to each vertex on the CPU slows down complex shapes enough that it removes the benefit of batching. That tradeoff is clearly worth it when making a shape you'll continue to reuse, like in buildGeometry, but if the transforms change each frame and we have to keep incurring the cost, it may not be.

I think it's definitely worth surfacing manual batching in our docs and tutorials, and if investigation shows that we can get some good across-the-board improvements without needing a big overhaul of the rendering pipeline, that would be great too.

RandomGamingDev commented 9 months ago

So two questions to be answered, I think, are (1) what the added engineering complexity of batching would be, and (2) can we handle enough cases to make that complexity worth it.

For (1), I gather we'd have to store info about the draw call and avoid actually drawing it until we hit a new draw call that can't be grouped, or if we reach the end of draw()?

The added engineering complexity would mostly just be in adding to a buffer and rendering at the end of the draw() call with all of the draw calls which is what I'm thinking of.

And for (2), I think it wouldn't be able to support changes in blend modes, textures, line thickness, and probably color (definitely in 2D, and even for WebGL where we have vertex colors, if we have multiple colors per batch but they are all drawn at once, can we deterministically ensure the last one the user asked to draw ends up on top?)

Definitely with the CanvasAPI, but with WebGL we should be able to handle them, especially since WebGL has depth and depth buffers that can be used for ensuring that the last thing drawn ends up at the top even if rendered out of order.

We'd probably also have to see whether transforms are worth being flattened into one shape, or if doing the transforms to each vertex on the CPU slows down complex shapes enough that it removes the benefit of batching. That tradeoff is clearly worth it when making a shape you'll continue to reuse, like in buildGeometry, but if the transforms change each frame and we have to keep incurring the cost, it may not be.

It's definitely worth it to flatten it to all go into one buffer. We're already calculating the vertices so compressing them all into one shape shouldn't pose an issue especially since the transforms like 3d to 2d projection should for the most part still be done on the GPU for WebGL and doing the same thing, but using paths and moving around beginPath() and endPath() to draw things of the same stroke and color in a single draw call. Also, the thing is that this would divide on the cost. Instead of the multiple GPU calls currently slowing down programs, we're substituting them with adding to a CPU buffer and then rendering them all at the same time in a few or a single draw call at the end of draw(). Even the cost incurred by adding to a buffer instead is negligible to the possible performance gains and when especially considering the global nature of p5.js operations like fill() which already encourage users to write their code so that shapes of the same color are drawn together at the same time, which means that batching will come easily to many sketches.

I think it's definitely worth surfacing manual batching in our docs and tutorials, and if investigation shows that we can get some good across-the-board improvements without needing a big overhaul of the rendering pipeline, that would be great too.

It depends on what you consider large, but I think that the overhaul should be short of that and should be an above medium pipeline change.

My only issue with batching is how we will detect drawing multiple shapes with strokes that touch one another with the Canvas API since it'll affect the strokes if we don't separate those draw calls. For this, I see 2 main paths:

Calculate the radius of the circle containing the entire shape (aka distance to the farthest part of the shape) and then separate the draw calls if the circles are inside of one another (aka if the distance between the 2 shapes is less than the added distance to the farthest part of the shape of both)
Detect whether the shapes are touching using some collision detection algorithm like SAT in order to handle multiple shapes.

davepagurek commented 9 months ago

since WebGL has depth and depth buffers that can be used for ensuring that the last thing drawn ends up at the top even if rendered out of order.

This isn't true for blending transparency: some blend modes like ADD are commutative, but the default BLEND mode isn't. Also, when things are drawn at the same Z value, we use the painter's algorithm, so things the user asks to be drawn last end up on top. If everything ends up in one draw call, and everything is at the same z, it's unclear that we're able to preserve this property.

It's definitely worth it to flatten it to all go into one buffer. We're already calculating the vertices so compressing them all into one shape shouldn't pose an issue especially since the transforms like 3d to 2d projection should for the most part still be done on the GPU for WebGL and doing the same thing, but using paths and moving around beginPath() and endPath() to draw things of the same stroke and color in a single draw call.

If we want everything to be one buffer that we draw as one call, in WebGL, we need to deal with transforms. Currently, they're a uniform, so they don't change within a batch of vertices being drawn. I think we need to either:

apply the transform to all the vertices before sending it to the GPU
Add a matrix along with each vertex (similar to normals)
Split batches on transforms

(1) can be very costly. In this demo, I draw a bunch of bouncing balls, drawn as separate shapes, and drawn by flattening the transforms into one big vertex array using buildGeometry (since it already does this): https://editor.p5js.org/davepagurek/sketches/FAqbP5k8i Try toggling the TEST_CPU_TRANSFORMS constant to true to see how much the frame rate drops--for me, it drops from 60 to 20, and the majority of that time is doing the transform flattening: Screenshot 2024-02-12 at 11 01 38 AM

(2) is also costly as it sends a lot more data to the GPU by duplicating a transformation matrix per vertex. We did some similar tests when trying to optimize line rendering with instancing. On some computers, the duplication was fast and worth it overall, but on others, it made things much slower. Also, I don't think we can use WebGL stride and such to avoid the duplication because the shapes might not have the same number of vertices.

(3) would work around this issue, but puts some big limits on the usefulness of batching, since translate() and other transforms are very common in p5 code.

RandomGamingDev commented 9 months ago

This isn't true for blending transparency: some blend modes like ADD are commutative, but the default BLEND mode isn't. Also, when things are drawn at the same Z value, we use the painter's algorithm, so things the user asks to be drawn last end up on top. If everything ends up in one draw call, and everything is at the same z, it's unclear that we're able to preserve this property.

If the rendering depends on the values of the pixels being rendered over and has to be done in order, we can avoid batching those, although other types of blending would still work with batching. Otherwise, we can rely on the depth value in order to calculate what to do.

If we want everything to be one buffer that we draw as one call, in WebGL, we need to deal with transforms. Currently, they're a uniform, so they don't change within a batch of vertices being drawn. I think we need to either:

apply the transform to all the vertices before sending it to the GPU

Add a matrix along with each vertex (similar to normals)

Split batches on transforms

The main transformations can be easily done on the CPU since they don't rely on complex operations which means that they can still be batched in order to shorten the amount of draw calls. However, when dealing with more complex operations like 3d rotations, especially ones utilizing matrices (like the example you gave), which could better utilize the GPU's resources, we can just batch the ones with say, the same rotation, still reducing the number of draw calls while making sure that the GPU still does the transform operations, which should still provide better performance, or at least a performance deficit that's nearly negligible in the worst case scenario.

davepagurek commented 9 months ago

I think it's worth trying, although I'm anticipating there being a point at which the batching process itself starts to have more overhead than the separate batch calls + GPU transforms would, and I'd like to keep it from introducing slowness in those cases. Especially as there are many shapes, collision detection becomes more and more of a problem (games use spatial partitioning algorithms to help with that, but even coming up with partitions involves going over all the points in the shape, and some shapes can get big.) Setting a reasonable limit on the size of a batch could maybe help with that.

I think it's worth considering that it might be an easier and more effective change to let users do batching themselves:

It's easy to turn off the batching if you find it doesn't speed things up
People might be able to use it in a scenario where we can't if we're doing it automatically -- e.g. maybe we'd think shapes collide based on convex hulls, but actually the shapes are not convex, and the sketch author knows they won't intersect
We could be OK with blending, z order, etc breaking in those cases: it would be up to the user to batch the things where they know it's safe to batch or don't care about which one draws on top (and then we don't need lots of code to check for those cases)

RandomGamingDev commented 9 months ago

I think it's worth trying, although I'm anticipating there being a point at which the batching process itself starts to have more overhead than the separate batch calls + GPU transforms would, and I'd like to keep it from introducing slowness in those cases.

Because we're only talking about doing the main and simple transformations on the CPU, while more complex transformations that rely on things like matrix multiplications are done on the GPU, with the different GPU transformations having their own batches, it'd require an extreme and unusual condition like hundreds of translate and rotation statements getting called in an alternating order for there to be a significant enough deficit in performance to make the batching approach less performant compared to the non-batching one, which is unlikely in at least 99% of the real-world usage of p5.js.

Especially as there are many shapes, collision detection becomes more and more of a problem (games use spatial partitioning algorithms to help with that, but even coming up with partitions involves going over all the points in the shape, and some shapes can get big.) Setting a reasonable limit on the size of a batch could maybe help with that.

Even with collision detection, batching should still provide more than enough of a significant advantage to performance. We'll want to set a limit on the size of a batch, but it'll be massive and something that a lot of sketches won't reach, as realistically, if the detection is well optimized with say, quad-trees for collision, the time spent trying to detect collisions would still be less than the cost of separating everything into more draw calls. Plus, collision detection will only have to be done on shapes with strokes and that use the CanvasAPI, with the cost in performance being nearly insignificant unless the shapes are always in the same sector of the quadtree and require a more expensive check or just forgoing it and rendering the batch before moving on to another. All in all, collision detection is only needed for certain shapes, only has significant cost for some of them in certain conditions, and still provides a performance increase compared to the alternative, making it absolutely still worth it in this case. Although, we likely won't be doing collision detection for some of the custom shapes created with things like beginShape() and endShape(), especially if they're convex.

I think it's worth considering that it might be an easier and more effective change to let users do batching themselves:

It's easy to turn off the batching if you find it doesn't speed things up

People might be able to use it in a scenario where we can't if we're doing it automatically -- e.g. maybe we'd think shapes collide based on convex hulls, but actually the shapes are not convex, and the sketch author knows they won't intersect

We could be OK with blending, z order, etc breaking in those cases: it would be up to the user to batch the things where they know it's safe to batch or don't care about which one draws on top (and then we don't need lots of code to check for those cases)

While letting the user batch everything themselves would work for users that really want to optimize their performance, it'd be unnecessarily confusing for them and be an unnecessary loss in performance in nearly all cases for anyone not using these features from not batching like many other rendering libraries do. I think that, while manual batching options should be offered (especially better ones than just beginShape() & endShape()) like instancing in order to allow for hand-tuned performance especially when dealing with things like things that rely on the pixels before them (like blending), I think that we should try to automatically handle batching in order to maximize performance for sketches without that since it shouldn't make any significant changes to them other than the performance increase. Also, we should give the ability to disable and enable batching if wanted even if we use the automatic mode, not only for easier bug testing, but edge cases as well where the user might want to say, write an extension for p5.js without batching. I think that an enableBatching() and disableBatching() as well as a flushBatch()/renderBatch() function for forcing p5.js to render the batch early and start a new batch would be great options.

davepagurek commented 9 months ago

Plus, collision detection will only have to be done on shapes with strokes and that use the CanvasAPI

I'm not sure I'm following why it only applies to those. since the last drawn shape at the same z value should go on top, shouldn't all shapes need to know if they'll collide with something else in the batch?

it'd require an extreme and unusual condition like hundreds of translate and rotation statements getting called in an alternating order for there to be a significant enough deficit in performance to make the batching approach less performant compared to the non-batching one

If we have more than one transform in the batch, I think we have to multiply all points by the current matrix (possibly using a simpler method like you described to avoid actually doing a matrix multiply), but in that case, it mostly only matters how many points we multiply. (I suppose each time the matrix changes, we have to check if it's a simple matrix or not, and maybe that cost could add up? But I'm less concerned about that than about vertex counts.)

Because we're only talking about doing the main and simple transformations on the CPU, while more complex transformations that rely on things like matrix multiplications are done on the GPU, with different transformations having their own batches, it'd require an extreme and unusual condition like hundreds of translate and rotation statements getting called in an alternating order for there to be a significant enough deficit in performance to make the batching approach less performant compared to the non-batching one, which is unlikely in at least 99% of the real-world usage of p5.js.

Here's a demo that doesn't do matrix transforms, it just handles translation: https://editor.p5js.org/davepagurek/sketches/lPfCkf4vk You can play around with how many points per particle there are, and how many particles, to see how it affects the frame rate. For small numbers, batching is fine, but it starts to lag as it gets higher. On my computer, these all are handled fine using the current rendering method. A high number of points per particle isn't unusual, especially if shapes are drawn with curves. This is also with no collision detection yet.

Again, I think this is still worth trying, but it's not a clear cut across-the-board gain. I think when determining when to flush a batch, we'd need to consider:

At a certain number of vertices, don't try to do anything on the CPU
How much of an added cost it is to do the collision detection
How costly it is to check if a matrix is simple, and what counts as simple (is it just 2D matrices? do we special case non-rotation matrices? should we give up immediately whenever someone calls applyMatrix with a custom matrix?)
How do we handle custom shaders that might affect the positioning and therefore correctness of the collision detection? Do we give up immediately here too?
In real world sketches, are bounding boxes too loose and cause us to lose all performance gains by having boxes intersect and flush batches?
- If so, are there other collision checks that we can do that don't add too much overhead?

davepagurek commented 9 months ago

Also, because I think this wouldn't be a breaking change, this doesn't have the same somewhat tight deadline that other 2.0 proposals have (we want to put breaking api changes into the 2.0 release to comply with semantic versioning), so although I feel that this needs R&D and isn't as straightforward as other proposals, I'm down to help guide any of that R&D if contributors are interested in trying it, and we're free to finish it in a 2.x release if we need to.

RandomGamingDev commented 9 months ago

I'm not sure I'm following why it only applies to those. since the last drawn shape at the same z value should go on top, shouldn't all shapes need to know if they'll collide with something else in the batch?

For both the canvas API and WebGL, no (I'm assuming that you don't mean an actual separate "z value" for the CanvasAPI and are including it in this comment). An important piece of information is that batches should share certain attributes including fill color (we're assuming that color's opaque since we likely won't batch transparent objects). We can deal with batching on the Canvas API by creating a batch buffer, adding shapes to the batch as the code executes, and when once one of the attributes change, rendering the batch before going on to create another one in the same fashion. This ensures that everything that depends on the z (aka order of rendering for the Canvas API) due a change in attribute gets rendered correctly while still allowing for batching of shapes, whose shared attributes allows for batched rendering. I'd recommend making WebGL deal with it in about the same way, but we could also just add a z (fragment depth) value into the buffer so that things are rendered the same even if it's out of order if you think that'd be better.

The only major issue after that, and the reason why I'm talking about collision detection for shapes with strokes is because of this:

Batched Circles

Unbatched Circles

Both of these were rendered with the Canvas API in this example: https://editor.p5js.org/PotatoBoy/sketches/bK7dW9odY

As you can see, by batching shapes within a single path, any shapes that have strokes get their strokes combined due to them sharing a path, which is why we need to make sure that we don't end up batching shapes like this and accidentally render something like the Batched Circles example shown above.

If we have more than one transform in the batch, I think we have to multiply all points by the current matrix (possibly using a simpler method like you described to avoid actually doing a matrix multiply), but in that case, it mostly only matters how many points we multiply. (I suppose each time the matrix changes, we have to check if it's a simple matrix or not, and maybe that cost could add up? But I'm less concerned about that than about vertex counts.)

I was thinking about avoiding using matrices for transforms where they aren't necessary automatically instead of checking whether it's simple or not since it's often more efficient to not use them for transformations. However, as a side note, I do think that rotations with values of:

0°
90°
180°
270° when mapped to be within the 0-360 range (inclusive, exclusive) should be simplified down to swaps and negations.

Here's a demo that doesn't do matrix transforms, it just handles translation: https://editor.p5js.org/davepagurek/sketches/lPfCkf4vk You can play around with how many points per particle there are, and how many particles, to see how it affects the frame rate. For small numbers, batching is fine, but it starts to lag as it gets higher. On my computer, these all are handled fine using the current rendering method. A high number of points per particle isn't unusual, especially if shapes are drawn with curves. This is also with no collision detection yet.

That example uses a translation matrix which wasn't really what I had intended. Here's an example that doesn't utilize a transformation matrix for the transform operation https://editor.p5js.org/PotatoBoy/sketches/XFOO0ynfG and here's the results with the example you sent (they're also in the comments below the sketch's code):

CPU transforms
1. 13.633110270584812
2. 13.688285130002921
3. 13.662770309760374
Avg: 13.66138857

GPU transforms
1. 13.844476744186046
2. 13.643759177679883
3. 13.72785361603253
Avg: 13.738696513

CPU outperforms GPU by 0.077307943 ms/frame

As you can see, with simpler transformations, like the single translate operation for each particle, the CPU performs close to, if it doesn't outperform the GPU's operations. These are the operations I'm talking about including multiple of within a single batch through software transformations instead of hardware based matrix ones.

Again, I think this is still worth trying, but it's not a clear cut across-the-board gain. I think when determining when to flush a batch, we'd need to consider:

At a certain number of vertices, don't try to do anything on the CPU

How much of an added cost it is to do the collision detection

How costly it is to check if a matrix is simple, and what counts as simple (is it just 2D matrices? do we special case non-rotation matrices? should we give up immediately whenever someone calls applyMatrix with a custom matrix?)

How do we handle custom shaders that might affect the positioning and therefore correctness of the collision detection? Do we give up immediately here too?

In real world sketches, are bounding boxes too loose and cause us to lose all performance gains by having boxes intersect and flush batches? If so, are there other collision checks that we can do that don't add too much overhead?

(I added numbering so that it's more easy to see what's for what)

I think that the certain number of vertices part is pretty reasonable, although I do still think that the bound will be high enough for most sketches to never reach it.
The collision detection will be pretty expensive. I think what we go for is a quadtree where a shape is stored in a node based on its position and the amount of space it occupies, with it being higher up in the tree the larger it is. Whether or not we do any collision detection beyond that tree like with SAT (Separating Axis-Theorem) is up to the contributor and which they believe will most greatly impact the most sketches. (Note: We should try to get this in alongside the benchmark system or at least heavily benchmark the PR before merging it in order to make sure that whatever's chosen works well.)
I think we can go a bit beyond 2D for the operations. I think that non-rotation matrices like translate and scale are going to be some of the best and easiest transformations to do in software, although I think that we can and should include some rotations. As for applyMatrix(), I think that we should flush & render the current buffer once it's called, and just start building a new one based on the new transformation given by applyMatrix().
If the user is already using custom shaders, that means that they're choosing to optimize for performance by utilizing the GPU and have some understanding of the GPU and rendering as well as a willingness to dive into it, not to mention all the things that batching can mess with if a custom shader is active, which is why yes, we should give up on batching as soon as a custom shader is being used.
In the real world, yes, the collision detection could possibly be too lose and accidentally flush a batch, however that would only mean that we have an extra GPU render, which isn't that bad if we can get a performant and accurate enough collision detector. As for other collision checks, there's the previously mentioned quad-tree which I think would be quite effective.

RandomGamingDev commented 9 months ago

Also, because I think this wouldn't be a breaking change, this doesn't have the same somewhat tight deadline that other 2.0 proposals have (we want to put breaking api changes into the 2.0 release to comply with semantic versioning), so although I feel that this needs R&D and isn't as straightforward as other proposals, I'm down to help guide any of that R&D if contributors are interested in trying it, and we're free to finish it in a 2.x release if we need to.

While I do agree that this can be paused (p5.js has been fine without batching for quite a while after all), I'd still like to note that this will still require major pipeline changes in order to facilitate this and that 2.0 is prioritizing performance for its new choices like math libraries, which a rendering optimization like the addition of batching would fit perfectly into.

davepagurek commented 9 months ago

We can deal with batching on the Canvas API by creating a batch buffer, adding shapes to the batch as the code executes, and when once one of the attributes change, rendering the batch before going on to create another one in the same fashion.

Makes sense, I was still thinking of batching color in my comment.

The only major issue after that, and the reason why I'm talking about collision detection for shapes with strokes is because of this:

That example uses a translation matrix which wasn't really what I had intended. Here's an example that doesn't utilize a transformation matrix for the transform operation https://editor.p5js.org/PotatoBoy/sketches/XFOO0ynfG and here's the results with the example you sent (they're also in the comments below the sketch's code):

While I do agree that this can be paused (p5.js has been fine without batching for quite a while after all), I'd still like to note that this will still require major pipeline changes in order to facilitate this and that 2.0 is prioritizing performance for its new choices like math libraries, which a rendering optimization like the addition of batching would fit perfectly into.

So I'm mostly looking at this from a project management perspective. Like you said, there are a number of major changes required to make all this work, and so far it's all being discussed as one change. The danger is that, because this proposal doesn't exist in a vacuum, it has to compete for time resources with other proposals, and also involves modifying the same code as other proposals. To put this on as good footing as possible, I think it's best to divide this up into different chunks that provide incremental benefit on their own so that we aren't having to choose between doing all of this or doing none of it, and provide a clear path so that we can still easily continue down this road even if we aren't able to get to all of it in 2.0.

Here's an example of what I'm thinking. Each of these is a fairly large chunk of engineering, also ensuring it's well tested and correct, but still providing a tangible benefit:

Batching of small shapes with no strokes and the same color:
- Avoids all situations that would need collision detection (different colors, different textures, different stroke thickness, simultaneous strokes+fills, and 2D mode transparency since that doesn't blend in a batch. Time needs to be allocated to finding other cases that need to be avoided)
- Unbatches when the number of points is above a fairly small threshold to avoid slowing anything down
- Does not modify the transformation system yet
- Needs to modify the new renderer API to include a way to defer rendering to a batch
- Needs to modify the new renderer API to include a per-renderer check to see if a new shape can be batched
- Adds an API to manually avoid batching as a fallback
- Ideally, make this an optional included module so that it can fall back on no batching for users who don't feel they need it and don't want the larger library size
Efficiency improvements for simple transformations:
- Uses a more efficient implementation instead of a matrix for simple transforms
  - Since matrices are used everywhere in the codebase, I think it's probably not maintainable to not use them entirely, as it just isn't feasible to safely test that we've made modifications in all the necessary places for all cases
  - Instead, maybe we can make a different class that conforms to the same interface as p5.Matrix whose multiply implementation only modifies e.g. position values and a scale value internally
  - It would be able to convert itself to a normal matrix to comply with everything else in the p5.Matrix interface to make sure everything is always correct, leaving us the option of replacing those methods with more efficient versions as necessary
- We'd be able to increase the vertex count threshold accordingly
Simple collision detection to allow more properties to be in the same batch:
- Needs to modify the new shape APIs to be able to get bounds information out of all types of segments that one might put into shapes
- Needs to implement a data structure (maybe quadtree, maybe an occupancy grid, also a common choice in games)
- Would now enable simultaneous stroke + fill support, WebGL different colors, and 2D mode transparency
- Would possibly reduce the vertex count threshold to accommodate the extra work. Need to test to find the right tradeoff of cost per shape vs more shapes batched
More advanced collision detection
- Explore bounding circles or ellipses (O(number of vertices in the new shape + circles to compare against))
- Explore using SAT (O(number of vertices in the new shape + sum of vertices in all shapes being compared against))
- Possibly explore shape decomposition for better bounds
  - convex decomposition is an option (O(number of vertices * number of reflex angles))
  - Something simpler like splitting shapes into chunks of segments (picking an arbitrary chunk size) might be an option too
- For all options, this also involves testing to find a good cutoff threshold when the number of checks is too high to be worth it and we unbatch

How do you feel about this structure and ordering?

RandomGamingDev commented 9 months ago

All of this sounds pretty good, although I do think that this shouldn't just be an optional included module since it's meant to impact the average sketch, not just performance conscious coders who already know how to properly optimize their code.