peterstace / simplefeatures

Simple Features is a pure Go Implementation of the OpenGIS Simple Feature Access Specification
MIT License
129 stars 19 forks source link

Remove switches in Geometry type dispatch #399

Open peterstace opened 3 years ago

peterstace commented 3 years ago

In type_geometry.go, there are lots of places where we perform a switch statement on the geometry type, grab out the concrete geometry, and then perform some action on it. Usually, the action is the same for each of the types. For example:

// AsText returns the WKT (Well Known Text) representation of this geometry.
func (g Geometry) AsText() string {
    switch g.gtype {
    case TypeGeometryCollection:
        return g.AsGeometryCollection().AsText()
    case TypePoint:
        return g.AsPoint().AsText()
    case TypeLineString:
        return g.AsLineString().AsText()
    case TypePolygon:
        return g.AsPolygon().AsText()
    case TypeMultiPoint:
        return g.AsMultiPoint().AsText()
    case TypeMultiLineString:
        return g.AsMultiLineString().AsText()
    case TypeMultiPolygon:
        return g.AsMultiPolygon().AsText()
    default:
        panic("unknown geometry: " + g.gtype.String())
    }
}

Having this switch statement everywhere isn't ideal:

I'm not too sure what the ideal solution should be... Here's one idea though: We can create a helper that tries to wrap a single interface around all geometry types. The helper then does the switch for us. While the switch statement is still there, it would be shared across all instances of the switch pattern (so we'd only have a single switch, instead of over a dozen). The downside of this approach is that it's much less explicit, and a little bit more magical. It would look something like:

// AsText returns the WKT (Well Known Text) representation of this geometry.
func (g Geometry) AsText() string {
    var iface interface {
        AsText() string
    }
    g.cast(&iface)
    return iface.AsText()
}

// cast tries to store the geometry into a particular interface. The iface
// parameter should be a pointer to the interface value that is attempting to
// be casted to. It panics if it's unable to assign the concrete geometry type
// to the interface.
func (g Geometry) cast(ptrToInterface interface{}) {
    ptrVal := reflect.ValueOf(ptrToInterface)
    interfaceVal := ptrVal.Elem()
    switch g.gtype {
    case TypeGeometryCollection:
        interfaceVal.Set(reflect.ValueOf(g.AsGeometryCollection()))
    case TypePoint:
        interfaceVal.Set(reflect.ValueOf(g.AsPoint()))
    case TypeLineString:
        interfaceVal.Set(reflect.ValueOf(g.AsLineString()))
    case TypePolygon:
        interfaceVal.Set(reflect.ValueOf(g.AsPolygon()))
    case TypeMultiPoint:
        interfaceVal.Set(reflect.ValueOf(g.AsMultiPoint()))
    case TypeMultiLineString:
        interfaceVal.Set(reflect.ValueOf(g.AsMultiLineString()))
    case TypeMultiPolygon:
        interfaceVal.Set(reflect.ValueOf(g.AsMultiPolygon()))
    default:
        panic("unknown geometry: " + g.gtype.String())
    }
}
albertteoh commented 3 years ago

A thought/idea:

Benefits:

Drawbacks:

I don't have a thorough understanding of the codebase, so there may be other drawbacks/benefits that I haven't considered. Would be interested in people's thoughts on this proposal.

peterstace commented 3 years ago

Warning: big wall of text coming 😅

Thanks for the detailed suggestion!

I agree with most of the benefits that you outlined. In particular, you proposal solves the following nicely:

A common pattern adopted in callers using simplefeatures is to check g.Is...() then cast with g.As...(). Using type assertions, I feel, more elegantly does this as a single line v, ok := g.(...).

This is a really good point, and I think it goes further than elegance. I've seen people make the mistake (and I've made it myself) where g.IsMultiPoint() (for example) is checked, but then g.AsMultiLineString() is executed in the if statement. This causes a panic. I suspect this is due to people rushing (or being too willing to accept autocomplete suggestions without scrutinising them!). In either case, it's an easy mistake to make, and people will continue to make this mistake. The root cause is having to specify "MultiPoint" as part of the method name for both calls, so there will be the chance that they will differ accidentally. By using a type assertion, "MultiPoint" is only specified once, so there's no opportunity for a mismatch.

This is a nice advantage too:

Should remove the need for AsGeometry() since every concrete geometry implements the Geometry interface.

This simplifies the library API, and anything that simplifies the API (or makes it easier to use) is a definite win.

The other advantages are nice too, although I think they're really secondary advantages. Things that have to do with making internal parts of the library nicer are less important compared to making the API itself better. E.g. things like eliminating the switch statements and the panics.

For the drawbacks you outlined:

It's a breaking change, although being pre-v1.0, perhaps it's better now than later.

Agreed, this isn't ideal, but I don't think it's a show stopper.

I do like the discoverability of the g.Is... and g.As... idioms, and so we can certainly keep them; or another option is to introduce a g.To... that returns the casted type and a boolean/error.

This is a pretty cool idea actually... It's not something that I'd considered before. It solves the problem where the type name (such as "MultiPoint") can differ between Is and As causing a panic.

From initial investigations, it seems this proposal would involve a significant set of non-trivial changes.

This is true, but I don't think that should stop us. On the plus side, most of the changes would be relatively mechanical, so would be unlikely to be tricky.


There are a few other major disadvantage to using a Geometry interface, and they are actually the reason we switched to the current approach of having Geometry as a struct. Originally we had the Geometry type as an interface, fairly similar to your proposal here. We switched over to a struct in this PR, back in early 2020.

There are some comments in that PR that describe why we went with that way of doing things, but I'll recap here since I don't think I did a great job out outlining the reasons back then. I'll order the reasons from what I consider to be the most important to the least important:


I'm fairly hesitant to move forward with the proposal due to the disadvantages outlined above, even though I do agree that some of the advantages that the proposal would bring are quite valuable.

In particular, I really like your idea about the ToFoo() (Foo, bool) idiom, replacing a IsFoo() bool check followed by a AsFoo() Foo conversion. I think that's something that we could pursue independently of the rest of your proposal.

albertteoh commented 3 years ago

Warning: big wall of text coming 😅

Thanks for the warning ... and ditto 😅

The responses that follow are fairly abstract and on reflection, hard to understand, so in an effort to help explain what I mean, I've thrown together a quick runnable proof of concept to hopefully "illustrate" my responses:

https://play.golang.org/p/bnbR5Bk0T-D


we need the input geometries to be one of the 7 known concrete primitive types, since they are all treated very differently (and we'd need to cast to the concrete types in order to treat each differently)

IIUC, the problem we're facing with converting Geometry to an interface is that Geometry could be one of many concrete implementations (*Polygon, **Polygon, MyCustomPolygon, etc.) which would be near impossible to reliably cast; and I wholeheartedly agree. 😄

However, must we cast to concrete types? Could we instead cast to a Polygon (for example) interface type with a ToPolygon() (Polygon, bool)?

I'm mindful returning an interface breaks the "accept interfaces, return structs" principle, but I would ask the question "why". The principle only applies in cases where we're returning a single concrete type; in our case, it could return many implementations (as before *Polygon, **Polygon, MyCustomPolygon, etc.). This blog post puts it quite nicely:

"For example, if your function actually can return multiple types then obviously return an interface."

A problem arises when a user decides to pass in their own implementation of a Geometry interface.

Are the legitimate reasons to allow for this? If not, we can ensure that ToPolygon() returns a sealed Polygon interface, that prevents users from creating their own Polygon.

If Geometry is an interface, we no longer have the ability to execute any code for it when it's in its "zero-value" state (which would be nil if Geometry is an interface).

This is an excellent counter-argument to using an interface that I wasn't aware was a requirement. It's a pretty cool feature actually, to dynamically unmarshal a generic geometry JSON field to its concrete underlying geometry.

I think we could solve this by having a concrete Geometry struct that allows us to have a non-nil zero-value to help with unmarshalling, which itself embeds the generic Geometryer (for want of a better interface name), which exposes the common Geometry methods. This Geometryer can be set with any underlying geometry that implements its interface.

A Geometry interface would be a very large interface (the bigger the interface, the weaker the abstraction). We'd be looking at somewhere between 1 and 2 dozen methods on the interface.

Yes, I agree with this, and why the io.Reader is such a nice interface because it easy for many concrete types to implement.

In an effort to understand the reason behind this proverb, I went to the source itself :) https://www.youtube.com/watch?v=PAAkCSZUG1c&t=317s

My take-away from Rob Pike's talk is that the smaller the interface is, the more useful and re-usable it becomes.

A question in my mind is, is this a requirement for us? Do we want the Geometry interface to be re-usable to thousands of possible implementations? It seems we have a well-defined set of 7 concrete geometries and so intentionally want to limit the possible implementations to these 7.

An extension to this is that a strong abstraction, to me, means all concrete types implementing that interface have a reason to implement it, and not just forced to return a sentinel or zero value because it's not relevant. For example, if the Geometry interface were to have a NumRings() int method, I feel this would be a weak abstraction since it's not relevant to a Point.

From a cursory look at the methods defined in type_geometry.go, it seems it's been quite well designed such that those methods with the switch blocks appear to have all supported geometry types as cases. For example, it makes sense for all geometries to have IsEmpty, Dimension, Centroid, AsText, etc. methods. As such, it appears the Geometry interface can be a strong abstraction, even though it contains a large number of methods; assuming they're all used and necessary.

Having said that, I could be wrong and perhaps it makes sense break these methods out into more specialised sub-interfaces or removing methods altogether...I don't have enough domain knowledge right now to understand if this is necessary or not.

peterstace commented 3 years ago

IIUC, the problem we're facing with converting Geometry to an interface is that Geometry could be one of many concrete implementations (*Polygon, **Polygon, MyCustomPolygon, etc.) which would be near impossible to reliably cast;

Yes, your understanding is correct.

However, must we cast to concrete types? Could we instead cast to a Polygon (for example) interface type with a ToPolygon() (Polygon, bool)?

That's a good point. For the most part, geometric algorithms operate on concrete geometry types via their exposed methods and don't access their internals (even though the algorithms are in the same package as the concrete type definitions themselves). So it would be feasible from a technical perspective to do everything via interfaces corresponding to each concrete implementation.

The principle only applies in cases where we're returning a single concrete type; in our case, it could return many implementations (as before *Polygon, **Polygon, MyCustomPolygon, etc.).

To clarify, simplefeatures itself as a library would never return anything other than Polygon to a user (i.e. it wouldn't return a *Polygon, **Polygon, or MyCustomPolygon). I was more concerned that a user would supply one of those extra types as the value of a generic Geometry interface (or a Polygon interface, as you describe above).

Are the legitimate reasons to allow for this? If not, we can ensure that ToPolygon() returns a sealed Polygon interface, that prevents users from creating their own Polygon.

There's no legitimate reason to allow it, and it's a good idea to prevent it. Sealed interfaces are a good solution to that problem.

I think we could solve this by having a concrete Geometry struct that allows us to have a non-nil zero-value to help with unmarshalling, which itself embeds the generic Geometryer.

Ahh, yes that's very similar to the original solution we had here. The concrete geometry type used for unmarshalling was called AnyGeometry and contained an interface named Geom of type Geometry (back then, Geometry was an interface). That worked relatively well from a technical perspective, but we kept running into the problem where people were not using it correctly. Most typically, developers would keep trying to unmarshal into a Geometry interface and then complain that it didn't work even though the AnyGeometry was available for them to use. I think users generally knew about AnyGeometry, but I think it was a bit hard to grok. To me as a user (and with the experience gained by being one of the library authors) it felt clunky, but I think to other users it just felt confusing.

My take-away from Rob Pike's talk is that the smaller the interface is, the more useful and re-usable it becomes.

A question in my mind is, is this a requirement for us?

I think I agree with you on that point. We don't want a generic geometry interface to be reused by any implementations other than our own, so the size of the interface doesn't matter much. We don't care that it would be hard for others to implement, since we don't want them to anyway.

As such, it appears the Geometry interface can be a strong abstraction, even though it contains a large number of methods; assuming they're all used and necessary.

Yes, agreed.

albertteoh commented 2 years ago

I gave my proposal a shot, and in summary, I believe it's not feasible.

The code changes (excuse the mess, the goal was just to get tests to pass) can be found in https://github.com/albertteoh/simplefeatures/tree/399-use-interface, and the relevant piece of code discussed starts around here.

I got to a stage where all unit tests pass; however, there were some significant performance regressions, particularly around MultiPolygons, which averaged over 100%.

MultipolygonValidation/n=1-6                                    8.00 ± 0%     11.00 ± 0%   +37.50%  (p=0.000 n=15+15)
MultipolygonValidation/n=4-6                                    11.0 ± 0%      20.0 ± 0%   +81.82%  (p=0.000 n=15+15)
MultipolygonValidation/n=16-6                                   27.0 ± 0%      60.0 ± 0%  +122.22%  (p=0.000 n=15+15)
MultipolygonValidation/n=64-6                                   91.0 ± 0%     220.0 ± 0%  +141.76%  (p=0.000 n=15+15)
MultipolygonValidation/n=256-6                                   347 ± 0%       860 ± 0%  +147.84%  (p=0.000 n=15+15)
MultipolygonValidation/n=1024-6                                1.37k ± 0%     3.42k ± 0%  +149.45%  (p=0.000 n=15+15)
MultiPolygonTwoCircles/n=10-6                                   29.0 ± 0%      44.0 ± 0%   +51.72%  (p=0.000 n=15+15)

Zooming in on this particular benchmark to understand why there was such a large regression, a difference that stood out most are the additional memory allocations.

Running the MultipolygonValidation benchmark on its own, we can see my changes introduced more than double the memory allocations:

Before Benchmark

$ go test -bench MultipolygonValidation -benchmem -memprofile memprofile.out -cpuprofile profile.out  -run=^$ ./geom
goos: linux
...
BenchmarkMultipolygonValidation/n=1024-6            2263        497963 ns/op      271285 B/op       1371 allocs/op
...

After Benchmark

$ go test -bench MultipolygonValidation -benchmem -memprofile memprofile.out -cpuprofile profile.out  -run=^$ ./geom
goos: linux
...
BenchmarkMultipolygonValidation/n=1024-6            1980        585566 ns/op      304093 B/op       3420 allocs/op
...

Initially, I didn't quite understand why since I haven't introduced any new structs or attributes to existing geometries.

Digging a bit further into the problem, we can see a difference that stands out between the...

Before Memory Profile

orig_mem

and...

After Memory Profile

399_mem

... in that there's an additional bit of memory allocated to the call to lineString.ForceCoordinatesType from polygon.ForceCoordinatesType.

This additional allocation has quite an impact as we see in the (please ignore the colours, they convey no meaning)...

Before Flame

orig_torch

and ...

After Flame

399_torch

In the "After Flame", NewMultiPolygon occupies 20% of CPU time compared to 10% CPU time in "Before Flame" which correlates with our ~100% regression in performance we noted earlier.

More importantly, of that 20%, about two-thirds of that time is spent on memory allocations, both from the polygon.ForceCoordinateType and lineString.ForceCoordinateType(along with the final third on a slice allocation)! Whereas, originally, it was just a single memory allocation for []LineString.

So it looks like memory allocations are expensive (not to mention the GC later on).

Next question is, what exactly is causing the additional memory allocations?

Ah, we see that it's on line type_line_string.go L404 and type_polygon.go L540,544 where these allocations occurred.

(pprof) list ForceCoord
Total: 89872162
ROUTINE ======================== github.com/peterstace/simplefeatures/geom.lineString.ForceCoordinatesType in /vagrant/go/src/github.com/albertteoh/simplefeatures/geom/type_line_string.go
  17924643   17924643 (flat, cum) 19.94% of Total
         .          .    399:}
         .          .    400:
         .          .    401:// ForceCoordinatesType returns a new LineString with a different CoordinatesType. If a
         .          .    402:// dimension is added, then new values are populated with 0.
         .          .    403:func (s lineString) ForceCoordinatesType(newCType CoordinatesType) LineString {
  17924643   17924643    404:   return &lineString{s.seq.ForceCoordinatesType(newCType)}
         .          .    405:}
         .          .    406:
         .          .    407:// Force2D returns a copy of the LineString with Z and M values removed.
         .          .    408:func (s lineString) Force2D() LineString {
         .          .    409:   return s.ForceCoordinatesType(DimXY)
ROUTINE ======================== github.com/peterstace/simplefeatures/geom.polygon.ForceCoordinatesType in /vagrant/go/src/github.com/albertteoh/simplefeatures/geom/type_polygon.go
  36422471   54347114 (flat, cum) 60.47% of Total
         .          .    535:}
         .          .    536:
         .          .    537:// ForceCoordinatesType returns a new Polygon with a different CoordinatesType. If a dimension
         .          .    538:// is added, then new values are populated with 0.
         .          .    539:func (p polygon) ForceCoordinatesType(newCType CoordinatesType) Polygon {
  17826064   17826064    540:   flatRings := make([]LineString, len(p.rings))
         .          .    541:   for i := range p.rings {
         .   17924643    542:       flatRings[i] = p.rings[i].ForceCoordinatesType(newCType)
         .          .    543:   }
  18596407   18596407    544:   return &polygon{flatRings, newCType}
         .          .    545:}
         .          .    546:
         .          .    547:// Force2D returns a copy of the Polygon with Z and M values removed.
         .          .    548:func (p polygon) Force2D() Polygon {
         .          .    549:   return p.ForceCoordinatesType(DimXY)

... and we can confirm that these variables indeed escaped to the heap because we returned a pointer reference to those instances (along with the original slice memory alloc):

$ go build -gcflags '-m -l' ./geom 2>&1 | grep type_line_string | grep 404
geom/type_line_string.go:404:9: &lineString{...} escapes to heap

$ go build -gcflags '-m -l' ./geom 2>&1 | grep type_poly | grep 544
geom/type_polygon.go:544:9: &polygon{...} escapes to heap

# This memory allocation is also present in the original code, so we can "ignore" it.
$ go build -gcflags '-m -l' ./geom 2>&1 | grep type_poly | grep 540
geom/type_polygon.go:540:19: make([]LineString, len(p.rings)) escapes to heap

Why return a pointer and not just a copy so that it stays on the stack avoiding an expensive allocation?

It's because the existing API exposes a Scan function that overwrites the receiver's value. As such, the receiver must be a pointer, otherwise we're just overwriting the copy.

Because the primary goal of this proposal was to abstract the geometry types behind an interface, as long as the Scan requirement is present, we must always use the pointer of the underlying instance for it to satisfy the interface contract, hence why a pointer is returned in MultiPolygon (and other geometries).

Given the significant change involved in this particular proposal and little benefit gained in performance or usability, I think it wouldn't warrant further effort to explore optimisations.

peterstace commented 2 years ago

Thanks for the detailed writeup! Ahh, I didn't realise that the impact of all of the heap allocations would be that severe. Btw, I've found in the past that even if if pointers aren't used you will get a heap allocation anyway as soon as you store a value in an interface. So even with Scan, it would be hard to avoid.

peterstace commented 2 years ago

or another option is to introduce a g.To... that returns the casted type and a boolean/error.

I really like this idea, and I think it's worth pursuing. I'll created a new ticket for it, since it's a bit different from the issue here.