meerk40t / svgelements

SVG Parsing for Elements, Paths, and other SVG Objects.
MIT License
133 stars 29 forks source link

Parallelisation of `point()` #61

Closed abey79 closed 3 years ago

abey79 commented 3 years ago

As discussed in #52, here is a first draft of the parallelisation of point() for Line, Close, Arc, QuadraticBezier and CubicBezier. The parallel version is currently named _point_numpy() and not automatically called by point().

This draft PR also cleans up code duplication between Line and Close, through a new common super-class LinearPathSegment.

Fixes #52

Discussion on several points follows.

Path

This PR draft does not address Path.point(). In my project, I iterate over segments in order to avoid unnecessarily splitting straight elements (Line, Close). A properly linearised version of Path.point() would be somewhat complicated and would have a limited usefulness (because of the possible presence of linear segments).

A more useful addition could be a Path.linearize(max_len) where max_len is the maximum length of line sub-segments used to approximate curved elements. That's basically what I am looking for and I essentially have this function in my project.

Shapes

Some shape do have a point() member, although nothing about it has currently been done in this PR. In particular, _RoundShape has a lot of code duplication with Arc that would likely benefit from deduplication.

Would an approch such as the following be feasible?

# class Shape
def point(self, pos):
   p = Path(self)
   p.reify()
   return p.point(pos)

The parallelisation of this does bring back the Path.point() issue, although in this particular case paths would in principle contain a single segment (Arc, for _RoundShape).

Tests

I'm not sure what the previous situation was, but with this PR (and #58), tests now require scipy and numpy to be installed in order to pass. This is hardly an issue for me, but it's probably worth noting.

Final API

Pending feedback and required iteration on this PR, one way of having the parallel version of point() in the API could be the following (Arc.point() as example):

    def point(self, position):
        try:
            if _has_numpy and isinstance(position, np.ndarray):
                return self._point_numpy(position)    
        except ImportError:
            pass

        if self.start == self.end and self.sweep == 0:
            # This is equivalent of omitting the segment
            return self.start

        if position == 0:
            return self.start
        elif position == 1:
            return self.end
        else:
            t = self.get_start_t() + self.sweep * position
            return self.point_at_t(t)

_has_numpy would be globally declared if global NumPy's import succeeded.

abey79 commented 3 years ago

Some quick and dirty benchmarks:

from svgelements import *
import numpy as np

t10 = np.linspace(0, 1, 10)
t100 = np.linspace(0, 1, 100)
t1000 = np.linspace(0, 1, 1000)

arc = Arc(Point(13.152548373912, 38.873772319489), Point(14.324014604836, 24.436855715076), Point(-14.750000067599, 25.169681093411), Point(-43.558410063178, 28.706909065029), Point(-19.42967575562, -12.943218880396), 5.89788464227)

%timeit for t in t10: arc.point(t)
180 µs ± 4.79 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit for t in t100: arc.point(t)
2.1 ms ± 47.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit for t in t1000: arc.point(t)
21.3 ms ± 522 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit arc._point_numpy(t10)
38.7 µs ± 857 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit arc._point_numpy(t100)
40.2 µs ± 589 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit arc._point_numpy(t1000)
66.1 µs ± 295 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

q = QuadraticBezier(Point(123, 234), Point(923, 123), Point(543, 123))

%timeit for t in t10: q.point(t)
69.6 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit for t in t100: q.point(t)
705 µs ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit for t in t1000: q.point(t)
7.02 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit q._point_numpy(t10)
20.2 µs ± 459 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit q._point_numpy(t100)
21.4 µs ± 914 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit q._point_numpy(t1000)
32.9 µs ± 492 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

c = CubicBezier(Point(234, 755), Point(123, 234), Point(923, 123), Point(543, 123))

%timeit for t in t10: c.point(t)
112 µs ± 958 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit for t in t100: c.point(t)
1.13 ms ± 17.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit for t in t1000: c.point(t)
11.5 ms ± 80.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit c._point_numpy(t10)
34.8 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit c._point_numpy(t100)
35.5 µs ± 347 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit c._point_numpy(t1000)
57.5 µs ± 754 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
tatarize commented 3 years ago

I would rename LinearPathSegment to be Linear while this doesn't seem to be a very useful change it makes this base class for Line and Close parallel the modern version in regbro/svg.path project. I have several times suggested changes to his project and made changes to mine to try to keep some interoperability. I didn't think it was worth de-duplicating but if you feel that it is paralleling the class name might be helpful going forward.

tatarize commented 3 years ago

as_complex() should obviously be a __complex__ as a dunder method, I added it already in 1.2.9's current master branch, since it was clearly an oversight on my part.

tatarize commented 3 years ago

The testing tended to already need scipy to pass. Since I tested the hypergeometric function for path length etc.

You'd edit the .travis.yml to make sure it could pass with test code testing the soft requirements. https://github.com/meerk40t/svgelements/blob/master/.travis.yml

abey79 commented 3 years ago

Noted. Tomorrow I will change the class name and rebase to use complex dunder instead of as_complex().

tatarize commented 3 years ago

Shapes don't strictly have a decomposition. They do because SVG 2.0 defined them. And I use their definitions pretty exclusively but saying something is .4 through a rectangle is a bit of a weird thing, but it would make sense to do this.

I would however not suggest using the code you have there. Except as maybe break-glass fallback code. As those are all kinda giant operations:

def point(self, pos):
   p = Path(self)
   p.reify()
   return p.point(pos)

You create a Path out of the decomposed elements then you reified that Path and got the point then threw everything out. For fallback code this obviously works. But, it's going to be better to actually do the same sort of calculations done on the Path() segments on the Shape segments here, caching the same value there, then find the non-reified Point and multiply just that point by the .transform.

Though I see why in your usecase this would be quite nice. Namely you could just apply a one size fits all .point() to anything that is of a Shape type including the Path.

Really some of the Path code here might be better off moved into the Shape code. But, clearly you're right and all of these should do their own .point() functions for any shape.

abey79 commented 3 years ago

This is likely past the intent of the present PR, but on way to go about it is to split Path/Shape class which would model the content of a SVG in all its diversity, and geometrical representation class (ArcGeom, PolyLineGeom, etc.) which would implement the geometry calculation and transformation. The Path/Shape class would each own a geometrical representation, to which they would delegate all geometrical computation. This way, Circle, Ellipse, and Arc would all use a ArcGeom representation.

This could also be a nice step towards a fuller Numpy optimisation as we could have two versions for the *Geom classes.

tatarize commented 3 years ago

The real fundamental difference between Arc and Circle is that ultimately Arcs are PathSegments and Circles are Shapes. These are defined in two radically different sections of the SVG Spec. The even worse offender here is between Line and SimpleLine. Literally they are both lines. They define a line. Technically they should both be called Line but I changed the stupid Shape line to be called SimpleLine. Also, regebro/svg.path#64 has a kind of elegant idea with the PathSegment of basing the Linear segments with Linear and the the curves segments with Curve. Which has a kind of nice parallelism to it. Also, it might give some opportunities to make solid fallback code depending on the type of segment it is.

tatarize commented 3 years ago

Okay, I've done some review and some of these functions aren't really needed. The issue with point is that it suffers a type error when trying to print the point. Since the data type is unknown by Length.str(), the code goes through a couple tests and then tries to cast it as a float but it can't do that so it crashes. But, the rest of the code works for this.

s = '%.12f' % (s)

Gets replaced with:

        try:
            s = '%.12f' % (s)
        except TypeError:
            return str(s)

When that happens we get Line/Close, CubicBezier, QuadraticBezier all working out of the gate. Since Point gets returned but the .x and .y are arrays. It does its job correctly and is basically just a class that stores a pair of values. Arcs will move differently depending on both variables. Unlike the other segments, you can't solve them independently in each dimension. This would absolutely require np.cos() and np.sin(), I can't a really reasonable way around this.

tatarize commented 3 years ago

Now, I'm finding this np compatible code for the segments. Arc.point() would need a soft requirement. Specifically if it suffers a type error at position=0, and we'd send try sending it to _numpy_point() and likely move _numpy_point_at_t() inline with that bit of code.

I'm not seeing anything here for Path.point(), this is certainly the end goal right?

Now, when these functions return Point classes with arrays for .x and .y these can be run through transform_point without issue.

>>> p = np.linspace(0,1,10)
>>> c = CubicBezier((0,0), (10,10), (20,-10), (30,0))
>>> q = c.point(p)
>>> q
Point([ 0.          3.33333333  6.66666667 10.         13.33333333 16.66666667
 20.         23.33333333 26.66666667 30.        ],[ 0.          2.30452675  2.88065844  2.22222222  0.82304527 -0.82304527
 -2.22222222 -2.88065844 -2.30452675  0.        ])
>>> Matrix("scale(5)").transform_point(q)
Point([  0.          16.66666667  33.33333333  50.          66.66666667
  83.33333333 100.         116.66666667 133.33333333 150.        ],[  0.          11.52263374  14.40329218  11.11111111   4.11522634
  -4.11522634 -11.11111111 -14.40329218 -11.52263374   0.        ])

The np code to do .point() on any shape would be to call .segments(transform=False). Apply the _calc_lengths() code or some equivalent (since this would be part of Shape) Then do the np code to access any indexes fitting the criteria of being within those length parameters, scaling that up to [0,1] for that segment and replacing each chunk within with the given points.

tatarize commented 3 years ago

So it looks like the correct path forward here is:

Path Forward

1) Correct Length.str() so it stops objecting to Point() of .x and .y arrays. 2) Add in a soft requirement at Arc.point() to defer to the _point_numpy() code on error. 3) Write .point() for all Shape classes and defer to ._point_numpy() code on objection. 4) The Shape._point_numpy() should utilize .segments(). Then loop through all the segment lengths and fill in the export values in the ranges that match for that segment range in each of them.

Might be unneeded.

tatarize commented 3 years ago

65 should do 3a (1 is already committed). This moves the Path.point() to Shape.point()

The reification failed to clear ._length and .lengths. Also, I thought you could get away with not reifying the shapes for a bit, but you really the shapes only work reified. You could have a square that is scaled non-uniformly and it would make evenly spaced points not evenly spaced if you applied the matrix afterwards. And .point() code is does not respect the transformation.

This should mean import the Arc._point_numpy() code, and I assume you either have or need numpy accelerated Shape.point() code. If not, I'll get to it tomorrow.

Most of the trying to use complex numbers isn't really needed. A 2xN array is fine for this stuff to move them around.

abey79 commented 3 years ago

A quick couple of comments:

I'm not seeing anything here for Path.point(), this is certainly the end goal right?

Yeah I hadn't done that yet in order to discuss the best way to do it, but yeah the soft requirement should be added to .point()

Edit: I initially mis-interpreted your comment. So yeah I guess Path.point() is still on the TODO list, although I'm unsure how to deal with embedded Move segment (ei compound paths).

Most of the trying to use complex numbers isn't really needed. A 2xN array is fine for this stuff to move them around.

That's true. As a matter of fact, if the order is good, you can get a Nx1 complex view of a Nx2 float array without incurring any copy. I've got a function for the opposite in my project:

def as_vector(a: np.ndarray):
    """Return a view of a complex line array that behaves as an Nx2 real array"""
    return a.view(dtype=float).reshape(len(a), 2)

The real advantage of complex is that it often simplifies the syntax because you don't have to worry about NP's broadcasting rules, etc.

I'll merge and adjust the PR to use Point with NP arrays!

abey79 commented 3 years ago

I did the following in 2f3724d2b9db918eba082ed694aabd1e13fd6771:

Here is the remaining TODO list for this PR (feedback obviously welcome!):

tatarize commented 3 years ago

I added a pull request on your pull request, this should add a Shape.point() and have the soft reference to np, when the position fails. The code there should work, but might be inefficient. Seems like there should have been a way to get the index values and I just rechecked the boolean.

There is some benefit to exposing Linear, if it's going to exist it should likely be exposed. Mostly the class would very clearly parallel the svg.path classes ( https://github.com/regebro/svg.path/blob/5b80c0f98b7be637fc1d201b75d2ce99b2cfabc2/src/svg/path/path.py#L38 ). Which has some benefit itself. And would maybe be better even with a dummy Curve for all non-linear PathSegments.

In theory all the parts are there, it's down to making it pretty and in the case of Shape._point_numpy() maybe needs a bit more TLC from somebody better at numpy than I currently am.

And I also think there's something to idea of having point() actually allow lists even without np. There'd be some considerable benefit to some of this stuff anyway. Since even without np there's some major speedups that could certainly be done. It's a bit outside of the current question at hand but I could absolutely see a .points() function that returns a PointList type (https://www.w3.org/TR/2015/WD-SVG2-20150709/shapes.html#InterfaceSVGPointList) It's not completely outside the svg spec.

abey79 commented 3 years ago

I merged your PR and the last changes on master. There was some conflicts I hope I didn't mess up, tests appear to pass though.

abey79 commented 3 years ago

I renamed _Linear to Linear. Tomorrow morning I will give a good look at Shape.point() and run some testing.

And would maybe be better even with a dummy Curve for all non-linear PathSegments.

I can definitely do that too if you feel it'd be better.

abey79 commented 3 years ago

And I also think there's something to idea of having point() actually allow lists even without np. There'd be some considerable benefit to some of this stuff anyway. Since even without np there's some major speedups that could certainly be done. It's a bit outside of the current question at hand but I could absolutely see a .points() function that returns a PointList type (https://www.w3.org/TR/2015/WD-SVG2-20150709/shapes.html#InterfaceSVGPointList) It's not completely outside the svg spec.

Although I see the point of this (pun not intended), I'm really wondering why someone interested in fast array operations would be reluctant to depend on NumPy. Given its incredible quality and widespread support, Numpy probably is the safest dependancy that you could possible add to a project. (If it wasn't for the constraints on the release process, I'm pretty sure it would be part of the standard library by now.)

Loosely related note: if you haven't seen it already, the official numpy paper published 2 months ago is a good read.

tatarize commented 3 years ago

There's some philosophy of not including any dependencies for stand-alone libraries. The best libraries stand alone. If somebody was interested in fast array operations they wouldn't be reluctant to depend on NumPy. But, if somebody was concerned with just robustly and correctly parsing SVG they might be reluctant to to depend on NumPy for that.

The use of a PointList on would be on my end would allow a consistent methodology here for getting a bunch of results at once, and also to do things like underpin the _PolyShape. class. I could likely rig it up so the soft requirements have less impact. There wouldn't be a hybrid array-like Point class where x is either a float or large array of numbers. But, PointList would have xs being either a list or an array of numbers. The class itself could either say whether it's numpy array back or whether it's list backed, or simply emulate a numpy array for most of the operations performed. Mostly it gets away from kind of weird result of a Point object with more than one point. But, also could provide speed boosts to those who don't care to include NumPy or query a Points() routine with a proper array. Also, svgelements tends to want useful things and a pointlist is a kind of useful structure. I could also use to back the LineSegments since they are all shapes of point-lists. Which has the kind of useful effect of making them able to conceal their backend structure and use numpy seamlessly if I needed by simply having the numpy include create a different bit of hidden backing data structure behind the implementation if numpy exists. Though I still don't see any easy way around the Arc point() code without a weird sort of delegation.

Either way, you should be good enough along to have really robust svg rendering and really fast linear interpolations. Happy publishing.

abey79 commented 3 years ago

I see your point. And yes, there is something to say about our current hybrid Point thing. Returning a PointList instead of a Point for a numpy input to .point() (or, ideally, any sequence input) would be a bit more elegant.

tatarize commented 3 years ago

Yeah, I'm thinking making a Pointlist class that when you call points() rather than point() it basically does the np.array code to calculate the typical answer. But, pointlist would cheat and work a bit like an array with regard to a overloading those main routines. The arc code would need to cheat a little bit and would do a for-loop over the data if it's a list and not an np.array. But, the points() could would assume you used a Pointlist or a np.array.

Basically rather than a import fail alternative method routine, it would assume everything is a normalish np.array and do the normal code return the answer. Pointlist would just be the assumed np.arrayless method to call that code if you wanted to do that. This would basically give us a clean bit of code without wrapping the answer in a Point at all. It would just return the answer, or call with a routine. Pointlist wouldn't even be needed to add right away. We could just have parallel code for numpy under points() rather than point.

I'm not sure if I should allow this code to settle out, or accept it as is, and just fix the bits that I view as rougher edges, before updating the svgelements on pypi.

abey79 commented 3 years ago

The points data member of Polygon and Polyline name-clashes with Shape.points(). One or the other has to be renamed. What would you suggest?

abey79 commented 3 years ago

In the last couple of commits I did the following:

Also, as noted above, Shape.points() is hidden by Polygon/Polyline points data member.

tatarize commented 3 years ago

Attribute definitions:

Name Value Initial value Animatable
points (none) yes

The points that make up the polygon. All coordinate values are in the user coordinate system.

Yeah, points is going to interfere. Lemme check the spec, or around there and see if there's a name that'll work. There's a couple functions that are almost similar SVGPathElement.getPointAtLength() and getPathSegAtLength() with .pathLength that would be more in line with the spec found elsewhere. But, that's too far from the beaten path.

Matplotlib has a function:

interpolated(self, steps)[source] Returns a new path resampled to length N x steps. Does not currently handle interpolating curves.

But not really doing the same thing.


I think the best name would be npoint() it conveys the sort of you want N points along this path, understanding. And includes np in the function name. npoints() would also seem logical but all the references there are tend to imply that function is very often used to mean len(polygon.points) for number points, like count them. I'm thinking the best name since points() is off the table is npoint().

tatarize commented 3 years ago

Yeah, my worst criticism is basically old docstrings. That's some damn fine code. I'll finish up a bit more testing and release the 1.3.0 version shortly. I think the utility outweighs the slight dishomogeneity of returning a list of points if numpy is or isn't installed. In theory if you didn't have numpy installed and you got that list of points and called some Point functions. This code would work fine on your machine but would crash on somebody else's machine because they had numpy installed. The alternative is to check the import and tell them to get bent without numpy. But so long as they access the stuff like [[],[]] it will all be consistent, and I'll just note it in the documentation. Worst case I'd remove the preamble include, and just add it to the npoint() code directly, it wouldn't change any of the API stuff so it's not much of an issue.