libgeos / geos

Geometry Engine, Open Source
https://libgeos.org
GNU Lesser General Public License v2.1
1.21k stars 362 forks source link

Geos 3.10 documentation contradicts implementation #564

Closed Ipiano closed 2 years ago

Ipiano commented 2 years ago

I understand that the C++ API should be treated as unstable, and I accept the risks of using an unstable API. This is not a complaint that the API changed. This is a bug in the documentation, which does not match the implementation.

As Geos stands today, it explicitly breaks one of the use cases outlined in the documentation as one that it caters to. I would like to understand if this is intentional or not so that I can decide whether to cut and run, fork the project, or try to fix it.

The documentation of geos::geom::CoordinateSequence states:

There are some cases in which you might want Geometries to store their points using something other than the GEOS Coordinate class. For example, you may want to experiment with another implementation, such as an array of Xs and an array of Ys. or you might want to use your own coordinate class, one that supports extra attributes like M-values.

You can do this by implementing the CoordinateSequence and CoordinateSequenceFactory interfaces. ...

Given this description, and the description of CoordinateSequence::getAt, which states

Whether or not the Coordinate returned is the actual underlying Coordinate or merely a copy depends on the implementation.

I believe it is reasonable to conclude that a custom CoordinateSequence implementation which does not store the underlying data as Coordinate objects should be allowed to construct temporary Coordinates any time it needs to interface with the rest of libGeos. Already, this is problematic because the getAt method is defined to return a reference, meaning at the very least the custom CoordinateSequence would have to hold a persistent Coordinate as a member variable that it could populate and return the reference to.

This issue becomes more obvious working with the filter functions CoordinateSequence::apply_ro and CoordinateSequence::apply_rw. The CoordinateFilter interface takes pointers to Coordinate objects, and in some cases these filters make assumptions about the underlying structure of the data or persistence of the pointers. The two specific issues I have come across while attempting to update libGeos are the implementations of geos::operation::valid::RepeatedPointFilter and geos::algorithm::locate::IndexedPointInAreaLocator::SegmentView. In the former case, there is an assumption made that the pointers passed in continue to be valid at least until the next pointer is passed in; in the latter case, there's an assumption that the two pointers passed in are adjacent to each other such that the first can be incremented to reach the second.

I see that a lot of the recent changes internally have been performance-oriented, and I laud that effort; however, they've come at the expense of an immensely useful piece of the API that has the potential to provide a bigger benefit than reducing the number of virtual calls in a hot loop. Allowing users to define their own CoordinateSequence type which integrates with the rest of geos and does not require using Coordinate as an underlying storage type allows users to utilize the power of this library without having to be constantly copying the data from their custom data storage format into and out of the Coordinate type.

I would like to know if this is an intentional break and using CoordinateSequence as I've described isn't actually supported. If so, the documentation should be changed to indicate that this use case is no longer supported. If not, then I believe the API of CoordinateSequence should be modified such that it doesn't return references or pass pointers to filters, and that filters should make no assumptions about the layout or persistence of data and should prefer to copy coordinates instead.

dbaston commented 2 years ago

I would say that the documentation has never been accurate, rather than recent changes breaking something, but please feel free to set me straight. As you point out, the getAt methods, which are only way to access the elements of a CoordinateSequence, return a reference to a Coordinate. Therefore, any implementation of CoordinateSequence must already have, or be able to produce, a reference to a Coordinate that is usable for the life of the CoordinateSequence.

You can't do much in GEOS without accessing Coordinates. A CoordinateSequence implementation that does not own a copy of all Coordinates (for example, a CoordinateSequence backed by three arrays of doubles) is going to have to create and persist a copy of each Coordinate the first time it's accessed. If you're going to have to copy each one of them on access, you might as well copy them all up front and stick them in a contiguous block of memory, no? That cost is miniscule compared to operations you might perform with the library (predicates/overlay/etc)

The flaw I see in CoordinateSequence is that it is an abstract type, so key methods like getAt are virtual and slow. But there's essentially no viable implementation other than CoordinateArraySequence. So we are paying a performance penalty for flexibility that is not needed.

pramsey commented 2 years ago

Time to remove the inheritance and have CoordinateSequence just be the implementation class? (Maybe also time to think about how we really want to handle M (and Z))

dbaston commented 2 years ago

Time to remove the inheritance and have CoordinateSequence just be the implementation class?

I think so. I'm toying with a single class that can store (via a union) any of (a) std::vector of Coordinates (b) a pointer to an external buffer of Coordinates (such as a 3D POINTARRAY) or (c) a single Coordinate as a value. That's a fun experiment, but I suspect CoordinateArraySequence is going to be good enough for all cases.

Ipiano commented 2 years ago

the documentation has never been accurate, rather than recent changes breaking something

As we've both noted, the current getAt makes it impossible to store data without using Coordinate as the underlying type, but I think until recently it would have been possible to back it with std::deque<Coordinate> (in theory - I'm not sure why you'd want to). But that's now definitely not possible because deque doesn't guarantee contiguous storage and there's assumptions about pointer math validity. That doesn't answer all the questions of how this could work, but it does demonstrate that other implementations could have been possible.

A CoordinateSequence implementation that does not own a copy of all Coordinates ... is going to have to create and persist a copy of each Coordinate

The fact that persisting the coordinate is required was not immediately apparent to me. Clearly it's required now, but the way that CoordinateSequence has been used previously did not require persisting the coordinate forever, as it would more often be copied or used immediately instead of the pointer stored. Not in all cases, but in enough that it worked for my use.

Regardless, it sounds like you're leaning toward removing the in inheritable nature of CoordinateSequence, which is a fine solution because it clarifies the expectations of the API.

Thank you for answering the question so quickly!

dr-jts commented 2 years ago

Time to remove the inheritance and have CoordinateSequence just be the implementation class? (Maybe also time to think about how we really want to handle M (and Z))

+1. Time to open an RFC or start a Discussion?

pramsey commented 2 years ago

Is there an action to be taken here? The documentation could be removed, but it's not wrong in theory, just a little blind to the limitations of a sub-classing approach in practice.

dbaston commented 2 years ago

I would say the quoted section is wrong and should just be removed:

There are some cases in which you might want Geometries to store their points using something other than the GEOS Coordinate class. For example, you may want to experiment with another implementation, such as an array of Xs and an array of Ys. or you might want to use your own coordinate class, one that supports extra attributes like M-values.

Yes, you might want this, but no, GEOS provides no means for you to do so.