Servers with many collections

cportele commented 6 years ago

Discussed during the WFS 3.0 Hackathon:

The Core right now is designed for smaller amounts of collections. Issues with a large number of collections:

The feature collections metadata response (/collections) may become large. This could be addressed by supporting paging and filtering in this resource. This could be done in an extension.
The API definition can become very large, too, and computation intensive to compile, if each collection is listed as a separate operation. This could be addressed by using a parameter for the collection name, i.e. having only generic, parameterised /collections/{name} and /collections/{name}/items/{id} operations. Information about the available collections would then have to be obtained from the /collections operation. Another option could also be to support filtering of collections in the api operation.

The paths used above are based on #64.

thorsten-reitz commented 6 years ago

Did you discuss an optional package construct for servers with many collections? In this way, collections could be organised in a hierarchical structure.

cportele commented 6 years ago

Discussion in web-meeting 2018-03-12: Not a pressing problem, do not address in the Core, but address as needed in an extension. Chuck will look into the issue and what can be described in the Guide.

lieberjosh commented 6 years ago

Sorry I was unable to attend the Hackathon, but it might be helpful to consider WFS 3.0 as various utility operations around a set of links to features. More complex or functional feature relationships may be better expressed in linked data entities and associated API’s such as SPARQL endpoints which then use WFS links to point at appropriate feature data including geometry.

—Josh

On Mar 8, 2018, at 11:41 AM, Clemens Portele notifications@github.com wrote:

Discussed during the WFS 3.0 Hackathon:

The Core right now is designed for smaller amounts of collections. Issues with a large number of collections:

The feature collections metadata response (/collections) may become large. This could be addressed by supporting paging and filtering in this resource. This could be done in an extension. The API definition can become very large, too, and computation intensive to compile, if each collection is listed as a separate operation. This could be addressed by using a parameter for the collection name, i.e. having only generic, parameterised /collections/{name} and /collections/{name}/items/{id} operations. Information about the available collections would then have to be obtained from the /collections operation. Another option could also be to support filtering of collections in the api operation. The paths used above are based on #64 https://github.com/opengeospatial/WFS_FES/issues/64.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/opengeospatial/WFS_FES/issues/76, or mute the thread https://github.com/notifications/unsubscribe-auth/AExWhi_oVUsa6mLxO04OzG1skZ2ks21lks5tcV9CgaJpZM4Si-rA.

cmheazel commented 6 years ago

OpenAPI allows you to define a path as a template. The variables in the template can be defined as coming from an enumerated list. The URL template and enumerated lists are defined by the Server Object. See issue #90

It is also possible to include a parameter in a path. The name of a Paths Object can be a URL template. Variables in that template are described using Parameter Objects where the "in" property = "path". Parameters can also be used in the query, header, or cookie. There are lots of options.

Lots of options means that we can create a big mess very quickly. In Issue 90 I propose a delimiter based approach for path templates. The intent is to provide flexibility while retaining semantics. We should be able to use this approach to address the multiple collections issue.

Take a look

cmheazel commented 6 years ago

Another perspective - OpenAPI describes an interface, not a server (or service). A single OpenAPI document can describe the offerings of dozens of servers. So it is reasonable to have multiple /collections paths as long as each one is rooted on a different URL. As I read the current draft, this would be an implementation decision.

jerstlouis commented 6 years ago

How about defining a hierarchy?

e.g. NaturalEarth/Cultural/ne_10m_admin_0_countries/

Filtering makes sense as an extension, but I feel that basic 'browsing' of a tree-like structure is a critically important piece of functionality.

cmheazel commented 6 years ago

@jerstlouis Are we looking for a way to include qualified names in a URI? For example, given the path /collections/{name} we could allow {name} to be "NaturalEarth:Cultural" or "NaturalEarth:Physical", etc. If so, then we just need to identify a delimiter for the qualified names which is legal to use in a URI template.

jerstlouis commented 6 years ago

Well I would certainly prefer the colon to the underscore, because the colon is a forbidden character in file names on Windows platforms and less likely to be used in a folder or layer name.

However, the main aspect of it though is to allow to list the different hierarchy levels without the entire contents of all sub-directories, when considering the cases of millions of layers (serving en entire mapping agency's data sets and cascading services from a single end-point). And that doesn't solve that.

The nice thing about that combined with filtering too is that your service can also act as your catalog without needing a specialized service for that.

I argue that the list of collections (or layers) an end-point serves is not a description of the interface (or capabilities) and belongs separately. If you connect to the SEDAC WMTS service for example ( http://sedac.ciesin.columbia.edu/geoserver/gwc/service/wmts?service=wmts&request=GetCapabilities ), you get a 5.6 MB XML file which I find a ridiculous amount of data to do an initial service handshake, and then all your layers show up in your client in a very long list where you cannot find what you're looking for. Listing data layers served should be a separate operation, and that should support simple hierarchies as well as optional filtering capabilities (per geospatial or temporal extent, scale/resolution, data type, keywords, meta data fields, etc.).

jerstlouis commented 6 years ago

Quoting you earlier @cmheazel , this is kind of my whole point:

Another perspective - OpenAPI describes an interface, not a server (or service). A single OpenAPI document can describe the offerings of dozens of servers.

Shouldn't the same OpenAPI description apply to ANY service?

Is it possible to leave the actual collections listing outside? And does/could OpenAPI support resource paths with variable depths?

Just found these OpenAPI issues which discuss this:

https://github.com/OAI/OpenAPI-Specification/issues/892 https://github.com/OAI/OpenAPI-Specification/issues/1459

They are proposing this:

If a “+” suffix modifier is present, e.g. "/items/{itemId+}", the path parameter can match zero or more URL path segments. We call these “multisegment” path parameters.

This would work perfectly.

jerstlouis commented 6 years ago

So it seems that OpenAPI allows you to list possible values by doing /api/{collectionID} for example. Shouldn't that be how it's done? Rather than stuffing all the layers inside /api which is the initial handshake?

And together with the multisegment path parameters it would support the use case.

OpenAPI doesn't currently support enumerating possible values for a parameter based on other parameters earlier in the path. In my opinion this is a major limitation and I filed an issue:

https://github.com/OAI/OpenAPI-Specification/issues/1693

This would allow to list the valid zoom levels for a given tiling scheme, the valid tiling schemes for a given collection etc.

cmheazel commented 6 years ago

I prefer to distinguish between services and APIs. A service is an implementation of the SOA pattern where processing is performed through service-specific operations. An API is an implementation of the Resource Oriented pattern where resources are accessed using HTTP verbs and paths. Can't say that everyone buys into this but I helps me to keep things straight.

cmheazel commented 6 years ago

@jerstlouis Parameter dependencies, an interesting concept.
Would support for qualified names help? A namespace coupled with a value? Based on the discussions you listed above, I think this would be acceptable (even legal under version 3.0.1) if we choose the correct delimiter.

cmheazel commented 6 years ago

@jerstlouis Another option could be to switch to the HATEOAS pattern at some point. We have added support for alternative schema to the response media type schema. This frees you from the requirement that a response is specified in JSON schema. OpenAPI would then take you to the top-level metadata definition, which provides links to the next level, and so on. Similar to the WFS 3 approach for Collections and Collection. (just brain-storming here).

cmheazel commented 6 years ago

@jerstlouis Now let's think about /api/{collectionId}. What you are asking for is separate OpenAPI documents based on the collection id. That's perfectly legal under OpenAPI. However, I would be worried about URI confusion. Across all of the multiple OpenAPI documents, is it possible for one extracted URL to point to two (or more) different resources?

pvretano commented 6 years ago

@cmheazel F.Y.I. WFS 2.5 took the HATEOAS approach. At every level there were hypermedia controls that would take you to the next resource(s). I prefer this approach.

akuckartz commented 6 years ago

Another option could be to switch to the HATEOAS pattern at some point.

:+1: One reason for #167

cportele commented 6 years ago

A few thoughts:

a. This discussion is starting to look like duplicates to #64 and #90. Maybe someone should make a concrete proposal for an extension with an approach that both works in OpenAPI and the HATEOAS pattern plus that would continue to support the current path pattern for the simpler cases, i.e. the Core.

b. As discussed in #64 I think there is value in having consistent patterns in the URIs (in addition to hypermedia controls in the responses), at least in the Core.

c. The discussion does not consider an important resource, the dataset. If we do not take this into account, we are making a mistake. In schema.org/DCAT (key taxonomies for publishing data on the Web) datasets are important resources and we need to represent this in our resource architecture. Only this will get our datasets properly indexed by search engines, etc.

Which is why the Core discusses datasets and distributions in a way that is consistent with schema.org/DCAT. At least for the Core the rule is that the part of an API that conforms to the spec (and has paths .../api, .../collections etc.) is for one dataset. I.e., that part of the API represents a distribution of the dataset.

So, if you have multiple datasets that should be published via a single API, the approach consistent with the Core would be something like .../{datasetId}/api, not .../api/{datasetId}. Same with .../{datasetId}/collections/{collectionId}/....

Any proposal for hierarchical collections should specify clearly how datasets and distributions are represented as resources in the proposal.

jerstlouis commented 6 years ago

@cportele my mention of /api/{collectionID} was referring to the OpenAPI functionality of enumerating the possible values for {collectionID}. With this, the /api itself could potentially be the same for different services serving different datasets.

I am in fundamental disagreement with the idea that a service end-point should represent a single dataset. I think of the service as directly mapping to an organization's SDI's server, serving all datasets available within it. This makes it possible to use the end-point to implement catalog queries and the likes. I have single piece of software serving all these data sets, why would I want more than one end-point? It doesn't make any sense to me to have multiple /api for this.

My proposal for hierarchical collections would depend on support for multi-segment paths that OpenAPI currently does not support ( /collections/{collectionID+} ). At some level within your multi-segment collectionID, you would have a 'dataset', where the meta data would reside. Some datasets already have a hierarchical structures and the current 'the whole end-point is a single data set' only accommodates a single level of 'collections' within that data set.

If we really wanted to make clear the dataset resource distinction I guess it would have to be /{dataSetId+}/collections/{collectionID+} to support multi-segment path in both the datasets as well as within the collections (for single data sets that have a more hierarchical structure). And then the /api could not be at the same level as collections without making it dataset specific...

cportele commented 6 years ago

@jerstlouis - I think you are mixing things. There is no question that it should be possible to use a single piece of software for serving multiple datasets (ours supports this, too), at the same time it should also be possible to use a microservices architecture. There are multiple ways how to extend the Core to allow that.

A boundary condition is that whatever we specify should be consistent with the Data on the Web Best Practices and identifying dataset and distribution resources plus providing metadata for them is an important part of it.

Whether it makes sense to support OpenAPI definitions for each dataset or not (i.e., whether to support modular APIs) is a separate discussion and I am not sure, if there is one answer for all cases. It could be an option to make the /api in the Core optional and the landing page for the dataset would simply be required to point to the OpenAPI definition for the whole API that includes the paths for that distribution (or .../api could redirect, or be an alternate convenience URI for the canonical URI of the OpenAPI definition for the whole API).

By the way, in the discussion that lead to the current path structure, we also discussed that it should be possible (for an API) to publish API definitions for each collection separately (see the /collections/buildings/api resource in the whiteboard image in #64).

opengeospatial / ogcapi-features

Servers with many collections #76