Open peterstace opened 3 years ago
A thought/idea:
Geometry
an interface.Polygon
, etc.) implement this interface.g.Is...()
and g.As...()
functions would be replaced with type assertions:
if v, ok := g.(...); ok {
// g.As...() casting functions would be made redundant.
Benefits:
type_geometry.go
file as well as when adding future enhancements.g.Is...()
then cast with g.As...()
. Using type assertions, I feel, more elegantly does this as a single line v, ok := g.(...)
.panic("unknown geometry: " + g.gtype.String())
.AsGeometry()
since every concrete geometry implements the Geometry
interface.Geometry
to hold a reference to the concrete geometry and the concrete geometry to create a new Geometry
in order to cast back to its "parent" reference.Drawbacks:
type_geometry.go
and the concrete geometries, but also others such as alg_simplify.go
and alg_convex_hull.go
; let alone the _test.go
files. I'd be willing to do the work though, if we decide to go with this proposal. 😄 g.Is...
and g.As...
idioms, and so we can certainly keep them; or another option is to introduce a g.To...
that returns the casted type and a boolean/error.I don't have a thorough understanding of the codebase, so there may be other drawbacks/benefits that I haven't considered. Would be interested in people's thoughts on this proposal.
Warning: big wall of text coming 😅
Thanks for the detailed suggestion!
I agree with most of the benefits that you outlined. In particular, you proposal solves the following nicely:
A common pattern adopted in callers using simplefeatures is to check
g.Is...()
then cast withg.As...()
. Using type assertions, I feel, more elegantly does this as a single linev, ok := g.(...)
.
This is a really good point, and I think it goes further than elegance. I've seen people make the mistake (and I've made it myself) where g.IsMultiPoint()
(for example) is checked, but then g.AsMultiLineString()
is executed in the if statement. This causes a panic. I suspect this is due to people rushing (or being too willing to accept autocomplete suggestions without scrutinising them!). In either case, it's an easy mistake to make, and people will continue to make this mistake. The root cause is having to specify "MultiPoint" as part of the method name for both calls, so there will be the chance that they will differ accidentally. By using a type assertion, "MultiPoint" is only specified once, so there's no opportunity for a mismatch.
This is a nice advantage too:
Should remove the need for
AsGeometry()
since every concrete geometry implements the Geometry interface.
This simplifies the library API, and anything that simplifies the API (or makes it easier to use) is a definite win.
The other advantages are nice too, although I think they're really secondary advantages. Things that have to do with making internal parts of the library nicer are less important compared to making the API itself better. E.g. things like eliminating the switch statements and the panics.
For the drawbacks you outlined:
It's a breaking change, although being pre-v1.0, perhaps it's better now than later.
Agreed, this isn't ideal, but I don't think it's a show stopper.
I do like the discoverability of the g.Is... and g.As... idioms, and so we can certainly keep them; or another option is to introduce a g.To... that returns the casted type and a boolean/error.
This is a pretty cool idea actually... It's not something that I'd considered before. It solves the problem where the type name (such as "MultiPoint") can differ between Is
and As
causing a panic.
From initial investigations, it seems this proposal would involve a significant set of non-trivial changes.
This is true, but I don't think that should stop us. On the plus side, most of the changes would be relatively mechanical, so would be unlikely to be tricky.
There are a few other major disadvantage to using a Geometry
interface, and they are actually the reason we switched to the current approach of having Geometry
as a struct. Originally we had the Geometry
type as an interface
, fairly similar to your proposal here. We switched over to a struct in this PR, back in early 2020.
There are some comments in that PR that describe why we went with that way of doing things, but I'll recap here since I don't think I did a great job out outlining the reasons back then. I'll order the reasons from what I consider to be the most important to the least important:
My understanding of the intention behind Go interfaces (in the most general sense) is that so long as the implementing type matches the method set of the interface (both in method signature and agreed upon method semantics), it should be acceptable to store in the interface. E.g. with the io.Reader
interface, so long as a type implementing that interface has a Read(p []byte) (n int, err error)
method that matches the agreed upon semantics, any type should be able to be used as an io.Reader
. This would be a bit of a problem in simplefeatures in a few places. E.g. when doing set operations on geometries (i.e. union/difference/intersection), we need the input geometries to be one of the 7 known concrete primitive types, since they are all treated very differently (and we'd need to cast to the concrete types in order to treat each differently). A problem arises when a user decides to pass in their own implementation of a Geometry
interface. It won't cast to a known concrete geometry type (i.e. all attempted type assertions will fail). So we'd have to give an error in those cases (or panic). While that's not too bad, it can get a little bit more subtle than that. If we were to make Polygon
(for example) implement a Geometry
interface, then *Polygon
will also implement that interface automatically. So we'd actually have to try to do a type conversion to to both Polygon
and then *Polygon
if the first fails, in a case a user unexpectedly passed in a pointer value rather than the raw struct (the concrete geometry types are very cheap to copy and are immutable, so they're intended to be passed by value rather than by pointer). How far would we have to go though? Since **Polygon
would also implement the Geometry
interface. Maybe that could be solved with helper functions that use reflect to strip off point indirection though. In general though, my conception of Go interfaces is that they specify behaviour, and when a function accepts an interface, it is really saying "I can operate on this value no matter what it's concrete type is and without even knowing what its concrete type is, so long as its methods are implemented according to the semantics defined by the interface." That doesn't seem to me to be a good semantic fit for a generic Geometry
type. With a Geometry
type, we really want to say "a Geometry
value is exactly 1 out of the 7 concrete geometry values, and we should be able to determine which one it is and access it".
If Geometry
is an interface, we no longer have the ability to execute any code for it when it's in its "zero-value" state (which would be nil
if Geometry
is an interface). This is pretty useful in a lot of cases. E.g. if you have a struct containing a Geometry
, you can simply call json.Unmarshal(buf, &myStructContainingGeometry)
and the Geometry
's UnmarshalJSON
method will take care unmarshalling from GeoJSON for you. A similar thing happens with Value
and Scan
methods used for automated DB/SQL marshalling/unmarshalling. Before switching to making Geometry
a struct in 87, we had an additional AnyGeometry
type which helped with this, but it was pretty clunky and hard to use.
A Geometry
interface would be a very large interface (the bigger the interface, the weaker the abstraction). We'd be looking at somewhere between 1 and 2 dozen methods on the interface.
I'm fairly hesitant to move forward with the proposal due to the disadvantages outlined above, even though I do agree that some of the advantages that the proposal would bring are quite valuable.
In particular, I really like your idea about the ToFoo() (Foo, bool)
idiom, replacing a IsFoo() bool
check followed by a AsFoo() Foo
conversion. I think that's something that we could pursue independently of the rest of your proposal.
Warning: big wall of text coming 😅
Thanks for the warning ... and ditto 😅
The responses that follow are fairly abstract and on reflection, hard to understand, so in an effort to help explain what I mean, I've thrown together a quick runnable proof of concept to hopefully "illustrate" my responses:
https://play.golang.org/p/bnbR5Bk0T-D
we need the input geometries to be one of the 7 known concrete primitive types, since they are all treated very differently (and we'd need to cast to the concrete types in order to treat each differently)
IIUC, the problem we're facing with converting Geometry
to an interface
is that Geometry
could be one of many concrete implementations (*Polygon
, **Polygon
, MyCustomPolygon
, etc.) which would be near impossible to reliably cast; and I wholeheartedly agree. 😄
However, must we cast to concrete types? Could we instead cast to a Polygon
(for example) interface
type with a ToPolygon() (Polygon, bool)
?
I'm mindful returning an interface breaks the "accept interfaces, return structs" principle, but I would ask the question "why". The principle only applies in cases where we're returning a single concrete type; in our case, it could return many implementations (as before *Polygon
, **Polygon
, MyCustomPolygon
, etc.). This blog post puts it quite nicely:
"For example, if your function actually can return multiple types then obviously return an interface."
A problem arises when a user decides to pass in their own implementation of a Geometry interface.
Are the legitimate reasons to allow for this? If not, we can ensure that ToPolygon()
returns a sealed Polygon
interface, that prevents users from creating their own Polygon
.
If Geometry is an interface, we no longer have the ability to execute any code for it when it's in its "zero-value" state (which would be nil if Geometry is an interface).
This is an excellent counter-argument to using an interface that I wasn't aware was a requirement. It's a pretty cool feature actually, to dynamically unmarshal a generic geometry JSON field to its concrete underlying geometry.
I think we could solve this by having a concrete Geometry
struct that allows us to have a non-nil zero-value to help with unmarshalling, which itself embeds the generic Geometryer
(for want of a better interface name), which exposes the common Geometry
methods. This Geometryer
can be set with any underlying geometry that implements its interface.
A Geometry interface would be a very large interface (the bigger the interface, the weaker the abstraction). We'd be looking at somewhere between 1 and 2 dozen methods on the interface.
Yes, I agree with this, and why the io.Reader
is such a nice interface because it easy for many concrete types to implement.
In an effort to understand the reason behind this proverb, I went to the source itself :) https://www.youtube.com/watch?v=PAAkCSZUG1c&t=317s
My take-away from Rob Pike's talk is that the smaller the interface is, the more useful and re-usable it becomes.
A question in my mind is, is this a requirement for us? Do we want the Geometry
interface to be re-usable to thousands of possible implementations? It seems we have a well-defined set of 7 concrete geometries and so intentionally want to limit the possible implementations to these 7.
An extension to this is that a strong abstraction, to me, means all concrete types implementing that interface have a reason to implement it, and not just forced to return a sentinel or zero value because it's not relevant. For example, if the Geometry
interface were to have a NumRings() int
method, I feel this would be a weak abstraction since it's not relevant to a Point
.
From a cursory look at the methods defined in type_geometry.go
, it seems it's been quite well designed such that those methods with the switch
blocks appear to have all supported geometry types as case
s. For example, it makes sense for all geometries to have IsEmpty
, Dimension
, Centroid
, AsText
, etc. methods. As such, it appears the Geometry
interface can be a strong abstraction, even though it contains a large number of methods; assuming they're all used and necessary.
Having said that, I could be wrong and perhaps it makes sense break these methods out into more specialised sub-interfaces or removing methods altogether...I don't have enough domain knowledge right now to understand if this is necessary or not.
IIUC, the problem we're facing with converting Geometry to an interface is that Geometry could be one of many concrete implementations (
*Polygon
,**Polygon
,MyCustomPolygon
, etc.) which would be near impossible to reliably cast;
Yes, your understanding is correct.
However, must we cast to concrete types? Could we instead cast to a
Polygon
(for example)interface
type with aToPolygon() (Polygon, bool)
?
That's a good point. For the most part, geometric algorithms operate on concrete geometry types via their exposed methods and don't access their internals (even though the algorithms are in the same package as the concrete type definitions themselves). So it would be feasible from a technical perspective to do everything via interfaces corresponding to each concrete implementation.
The principle only applies in cases where we're returning a single concrete type; in our case, it could return many implementations (as before
*Polygon
,**Polygon
,MyCustomPolygon
, etc.).
To clarify, simplefeatures itself as a library would never return anything other than Polygon
to a user (i.e. it wouldn't return a *Polygon
, **Polygon
, or MyCustomPolygon
). I was more concerned that a user would supply one of those extra types as the value of a generic Geometry
interface (or a Polygon
interface, as you describe above).
Are the legitimate reasons to allow for this? If not, we can ensure that
ToPolygon()
returns a sealed Polygon interface, that prevents users from creating their own Polygon.
There's no legitimate reason to allow it, and it's a good idea to prevent it. Sealed interfaces are a good solution to that problem.
I think we could solve this by having a concrete Geometry struct that allows us to have a non-nil zero-value to help with unmarshalling, which itself embeds the generic
Geometryer
.
Ahh, yes that's very similar to the original solution we had here. The concrete geometry type used for unmarshalling was called AnyGeometry
and contained an interface named Geom
of type Geometry
(back then, Geometry
was an interface). That worked relatively well from a technical perspective, but we kept running into the problem where people were not using it correctly. Most typically, developers would keep trying to unmarshal into a Geometry
interface and then complain that it didn't work even though the AnyGeometry
was available for them to use. I think users generally knew about AnyGeometry
, but I think it was a bit hard to grok. To me as a user (and with the experience gained by being one of the library authors) it felt clunky, but I think to other users it just felt confusing.
My take-away from Rob Pike's talk is that the smaller the interface is, the more useful and re-usable it becomes.
A question in my mind is, is this a requirement for us?
I think I agree with you on that point. We don't want a generic geometry interface to be reused by any implementations other than our own, so the size of the interface doesn't matter much. We don't care that it would be hard for others to implement, since we don't want them to anyway.
As such, it appears the Geometry interface can be a strong abstraction, even though it contains a large number of methods; assuming they're all used and necessary.
Yes, agreed.
I gave my proposal a shot, and in summary, I believe it's not feasible.
The code changes (excuse the mess, the goal was just to get tests to pass) can be found in https://github.com/albertteoh/simplefeatures/tree/399-use-interface, and the relevant piece of code discussed starts around here.
I got to a stage where all unit tests pass; however, there were some significant performance regressions, particularly around MultiPolygons
, which averaged over 100%.
MultipolygonValidation/n=1-6 8.00 ± 0% 11.00 ± 0% +37.50% (p=0.000 n=15+15)
MultipolygonValidation/n=4-6 11.0 ± 0% 20.0 ± 0% +81.82% (p=0.000 n=15+15)
MultipolygonValidation/n=16-6 27.0 ± 0% 60.0 ± 0% +122.22% (p=0.000 n=15+15)
MultipolygonValidation/n=64-6 91.0 ± 0% 220.0 ± 0% +141.76% (p=0.000 n=15+15)
MultipolygonValidation/n=256-6 347 ± 0% 860 ± 0% +147.84% (p=0.000 n=15+15)
MultipolygonValidation/n=1024-6 1.37k ± 0% 3.42k ± 0% +149.45% (p=0.000 n=15+15)
MultiPolygonTwoCircles/n=10-6 29.0 ± 0% 44.0 ± 0% +51.72% (p=0.000 n=15+15)
Zooming in on this particular benchmark to understand why there was such a large regression, a difference that stood out most are the additional memory allocations.
Running the MultipolygonValidation
benchmark on its own, we can see my changes introduced more than double the memory allocations:
Before Benchmark
$ go test -bench MultipolygonValidation -benchmem -memprofile memprofile.out -cpuprofile profile.out -run=^$ ./geom
goos: linux
...
BenchmarkMultipolygonValidation/n=1024-6 2263 497963 ns/op 271285 B/op 1371 allocs/op
...
After Benchmark
$ go test -bench MultipolygonValidation -benchmem -memprofile memprofile.out -cpuprofile profile.out -run=^$ ./geom
goos: linux
...
BenchmarkMultipolygonValidation/n=1024-6 1980 585566 ns/op 304093 B/op 3420 allocs/op
...
Initially, I didn't quite understand why since I haven't introduced any new structs or attributes to existing geometries.
Digging a bit further into the problem, we can see a difference that stands out between the...
Before Memory Profile
and...
After Memory Profile
... in that there's an additional bit of memory allocated to the call to lineString.ForceCoordinatesType
from polygon.ForceCoordinatesType
.
This additional allocation has quite an impact as we see in the (please ignore the colours, they convey no meaning)...
Before Flame
and ...
After Flame
In the "After Flame", NewMultiPolygon
occupies 20% of CPU time compared to 10% CPU time in "Before Flame" which correlates with our ~100% regression in performance we noted earlier.
More importantly, of that 20%, about two-thirds of that time is spent on memory allocations, both from the polygon.ForceCoordinateType
and lineString.ForceCoordinateType
(along with the final third on a slice allocation)! Whereas, originally, it was just a single memory allocation for []LineString
.
So it looks like memory allocations are expensive (not to mention the GC later on).
Next question is, what exactly is causing the additional memory allocations?
Ah, we see that it's on line type_line_string.go L404
and type_polygon.go L540,544
where these allocations occurred.
(pprof) list ForceCoord
Total: 89872162
ROUTINE ======================== github.com/peterstace/simplefeatures/geom.lineString.ForceCoordinatesType in /vagrant/go/src/github.com/albertteoh/simplefeatures/geom/type_line_string.go
17924643 17924643 (flat, cum) 19.94% of Total
. . 399:}
. . 400:
. . 401:// ForceCoordinatesType returns a new LineString with a different CoordinatesType. If a
. . 402:// dimension is added, then new values are populated with 0.
. . 403:func (s lineString) ForceCoordinatesType(newCType CoordinatesType) LineString {
17924643 17924643 404: return &lineString{s.seq.ForceCoordinatesType(newCType)}
. . 405:}
. . 406:
. . 407:// Force2D returns a copy of the LineString with Z and M values removed.
. . 408:func (s lineString) Force2D() LineString {
. . 409: return s.ForceCoordinatesType(DimXY)
ROUTINE ======================== github.com/peterstace/simplefeatures/geom.polygon.ForceCoordinatesType in /vagrant/go/src/github.com/albertteoh/simplefeatures/geom/type_polygon.go
36422471 54347114 (flat, cum) 60.47% of Total
. . 535:}
. . 536:
. . 537:// ForceCoordinatesType returns a new Polygon with a different CoordinatesType. If a dimension
. . 538:// is added, then new values are populated with 0.
. . 539:func (p polygon) ForceCoordinatesType(newCType CoordinatesType) Polygon {
17826064 17826064 540: flatRings := make([]LineString, len(p.rings))
. . 541: for i := range p.rings {
. 17924643 542: flatRings[i] = p.rings[i].ForceCoordinatesType(newCType)
. . 543: }
18596407 18596407 544: return &polygon{flatRings, newCType}
. . 545:}
. . 546:
. . 547:// Force2D returns a copy of the Polygon with Z and M values removed.
. . 548:func (p polygon) Force2D() Polygon {
. . 549: return p.ForceCoordinatesType(DimXY)
... and we can confirm that these variables indeed escaped to the heap because we returned a pointer reference to those instances (along with the original slice memory alloc):
$ go build -gcflags '-m -l' ./geom 2>&1 | grep type_line_string | grep 404
geom/type_line_string.go:404:9: &lineString{...} escapes to heap
$ go build -gcflags '-m -l' ./geom 2>&1 | grep type_poly | grep 544
geom/type_polygon.go:544:9: &polygon{...} escapes to heap
# This memory allocation is also present in the original code, so we can "ignore" it.
$ go build -gcflags '-m -l' ./geom 2>&1 | grep type_poly | grep 540
geom/type_polygon.go:540:19: make([]LineString, len(p.rings)) escapes to heap
Why return a pointer and not just a copy so that it stays on the stack avoiding an expensive allocation?
It's because the existing API exposes a Scan
function that overwrites the receiver's value. As such, the receiver must be a pointer, otherwise we're just overwriting the copy.
Because the primary goal of this proposal was to abstract the geometry types behind an interface, as long as the Scan
requirement is present, we must always use the pointer of the underlying instance for it to satisfy the interface contract, hence why a pointer is returned in MultiPolygon
(and other geometries).
Given the significant change involved in this particular proposal and little benefit gained in performance or usability, I think it wouldn't warrant further effort to explore optimisations.
Thanks for the detailed writeup! Ahh, I didn't realise that the impact of all of the heap allocations would be that severe. Btw, I've found in the past that even if if pointers aren't used you will get a heap allocation anyway as soon as you store a value in an interface. So even with Scan
, it would be hard to avoid.
or another option is to introduce a
g.To...
that returns the casted type and a boolean/error.
I really like this idea, and I think it's worth pursuing. I'll created a new ticket for it, since it's a bit different from the issue here.
In
type_geometry.go
, there are lots of places where we perform a switch statement on the geometry type, grab out the concrete geometry, and then perform some action on it. Usually, the action is the same for each of the types. For example:Having this switch statement everywhere isn't ideal:
I'm not too sure what the ideal solution should be... Here's one idea though: We can create a helper that tries to wrap a single interface around all geometry types. The helper then does the switch for us. While the switch statement is still there, it would be shared across all instances of the switch pattern (so we'd only have a single switch, instead of over a dozen). The downside of this approach is that it's much less explicit, and a little bit more magical. It would look something like: