project-open-data / project-open-data.github.io

Open Data Policy — Managing Information as an Asset
https://project-open-data.cio.gov/
Other
1.34k stars 583 forks source link

Improve metadata structure for listing web APIs #291

Closed philipashlock closed 9 years ago

philipashlock commented 10 years ago

This is a new issue for discussion from #261 and #224 that transitioned into a broader topic around the purpose of webService and the metadata around it. Some of the earlier discussion around webService occurred in #37

philipashlock commented 10 years ago

So you don't have to re-read the previous thread I'll copy over the last few posts on #261

From @philipashlock

my preference for a future version would be to put webService in the distribution array and to be paired with something analogous to a the format field for a URL or identifier that conveys what people should expect at the endpoint URL.

If we needed a new term for this, perhaps it would be called endpointType or serviceDefinition. An example would be a URI or identifier to denote that the endpoint URL specified is a description of the API represented as Swagger, RAML, or API Blueprint or even a specific kind of standardized endpoint like those based on Atompub (eg GData/OData), or WFS and WMS endpoints, or an Open311 API, or even generic database APIs like those associated with CouchDB and MongoDB. Some data catalogs are already set up to recognize special endpoints like this, eg with extensions CKAN can provide additional features for WMS endpoints or data stores supported by Recline.js like the CKAN Datastore.

The current POD schema differentiates file downloads from queryable/interactive endpoints with accessURL + format for file downloads and just webService for endpoints.

The analogous approach in DCAT is that file downloads are represented with downloadURL + mediaType while endpoints use accessURL but accessURL is meant to be inclusive so it could also be used for file downloads or even a landing page.

The distinction between format and mediaType is covered on #272

For reference, in DCAT this is how things are defined:

dcat:accessURL A landing page, feed, SPARQL endpoint or other type of resource that gives access to the distribution of the dataset

  • Use accessURL, and not downloadURL, when it is definitely not a download or when you are not sure whether it is.
  • If the distribution(s) are accessible only through a landing page (i.e. direct download URLs are not known), then the landing page link SHOULD be duplicated as accessURL on a distribution.

source: http://www.w3.org/TR/vocab-dcat/#Property:distribution_accessurl

dcat:downloadURL A file that contains the distribution of the dataset in a given format

  • dcat:downloadURL is a specific form of dcat:accessURL. Nevertheless, DCAT does not define dcat:downloadURL as a subproperty of dcat:accessURL not to enforce this entailment as DCAT profiles may wish to impose a stronger separation where they only use accessURL for non-download locations.

source: http://www.w3.org/TR/vocab-dcat/#Property:distribution_downloadurl

From @smrazgs

I think this reflects a lack of clarity about the scope and purpose of the metadata that's being constructed. In the webby world of html pages and web applications, having a URL associated with a resource is pretty straight forward. In the world of data, you have to think a little more deeply about what its for. Here's a perspective: The user is looking for data-- they want information about something. The first concern is finding the data at a fairly abstract level, something like 'water quality information in my county', 'particulate concentrations near power plants', 'average income of people with pink hair', 'how many gallons of milk were produced in Wisconsin last year'. Once they find something that looks like what they need, they have to figure out how to get it in a form they can use, and whether or not they trust the source of information. Data can be distributed in many ways--and describing the 'ways' you might get data in a way that machines can use it is a tricky problem. Sure, downloading CSV files is cool and easy, but what do those pesky column headings like 'avginc', 'mmmgpp','daysToMarket','24yly' mean? And as someone pointed out earlier, so you give me a URL for an API endpoint, how do I automate a client to use that. More realistically, the data is probably available through multiple distributions, and the client software ideally would be able to inspect a collection of links in the metadata (DCAT, ATOM, ISO19139...) and figure out which one the software works with. Delve into Cat-interop for more discussion and links...

The bottom line is that 1) many resources are available via multiple distributions, and these should be describable in the metadata in such a way that automated clients can use them (HATEOS if you like REST), so the distribution needs to allow multiple values; and 2) description of the links to make them machine-actionable requires associating properties with the links.

philipashlock commented 10 years ago

@smrazgs I think we're in full agreement here. There a few different aspects to this problem that don't have widely agreed upon conventions though. For example:

  1. Machine readable metadata for describing a RESTful API - There's no widely adopted standard here, but a number that are gaining traction and are certainly worth using (eg Swagger, RAML, API Blueprint, IODocs).
  2. Common identifiers to refer to RESTful API metadata types - There's no widely adopted standard here, but some have proposed media types, eg application/swagger+json for Swagger
  3. Simple metadata definitions for data within common encapsulation formats (eg CSV) - While sophisticated machine readable schemas are the norm in the semantic web community, they're isn't a widely adopted standard for simple data formats like CSV. I think proposals like the Simple Data Format have potential here though. As for JSON, we're already using JSON Schema (application/schema+json) right here on Project Open Data to define this metadata.

With these standards in place this metadata can be referenced in just the same way that any other media type would be and then the data can be interrogated programmatically. Using a media type to identify these specifications is just one possibility though, they could just as easily use a URI as is done with XML namespaces.

There are a number of other issues associated with APIs that aren't covered by these specs though, like where to go to get an API key, links to a staging or production version of the API, etc. For a lightweight spec that attempted to address things like that across many APIs, see the Service Discovery spec that has been implemented by most governments using Open311.

Another issue with APIs in the context of this metadata is that many APIs combine multiple datasets.

I think there are probably a number of scenarios where it would make more sense to create a whole new catalog entry for an API rather than just list it as another resource as part of an existing entry.

smrgeoinfo commented 10 years ago

@philipashlock good idea to start a new thread here. The back trail on this is long and has a low information density.

as for the issues you point out above:

the issue is how to communicate which convention for the self description document is being used (add wsdl, HAL, WADL, getCapabilities, OpenSearchDescription to your list). From my survey of practice, most well behaved services provide a standard request that returns a self-description document; the trick is you have let the client know up front which of the many possible conventions the endpoint you're giving them conforms to.

I think you're referring to the problem we're trying to address at github/cat-interop @tomkralidis

in the XML world these are xml schema, if I understand you correctly; in the JSON world we have JSON schema, in the rdf world rdfs... The issue is that there is a service protocol (http being the most common, but lots of implementations tunnel requests through http), an encoding syntax (xml, json, csv, NetCDF...), and an application scheme that communicates the semantics of the particular encoding practice (GeoSciML, WaterML, VOID, DCAT...). For a client to automate interaction with a data service, all of these factors want to be known up front; otherwise you play a giant guessing game, which might work...

philipashlock commented 10 years ago

the trick is you have let the client know up from which of the many possible conventions the endpoint you're giving them conforms to.

This is what I was addressing by the second point. For example application/swagger+json for metadata describing an API using the Swagger syntax

The issue is that there is a service protocol, http being the most common,

I think we can almost always assume HTTP in the context of a web based catalog

but lots of implementation tunnel request on top of http

This is where we need better identifiers, but again new media types like application/swagger+json could probably be good enough. Otherwise, agree on namespace URIs

an encoding syntax (xml, json, csv, NetCDF...),

Most of these are well established standards with defined MIME types

and an application scheme that communicates the semantics of the particular encoding practice

application/schema+json for JSON Schema, XML, and RDF, already define a way of referencing their schemas, but I do think we should also have something for simple tabular CSV data and that could be the Simple Data Format with it's own unique media type to identify it as well

philipashlock commented 10 years ago

I think my preferred approach to addressing this problem would be to simply align with the current version of DCAT which is to say:

tomkralidis commented 10 years ago

An API/service can provide multiple media types, so perhaps a defining the resource type may be useful, and then using mediaType if required. I'm not sure about using a media type to identify an API.

For example an OGC:WMS can provide an addressable URL to something that would be a thumbnail. Which, in this case, I wouldn't define the URL as a WMS, but a simple thumbnail/browse image (which, in this case, just happens to be realized via OGC:WMS).

At the same time I could specify an OGC:WMS base URL and, perhaps, the resource name (layer name in WMS speak) as a means to provide minimal information, which then the client could bind to.

Food for thought.

mhogeweg commented 10 years ago

:+1: separate the type of the resource from the response formats it can produce.

Sent via the Samsung Galaxy S™ III, an AT&T 4G LTE smartphone

-------- Original message -------- From: Tom Kralidis notifications@github.com Date: 03/13/2014 3:47 AM (GMT-08:00) To: "project-open-data/project-open-data.github.io" project-open-data.github.io@noreply.github.com Subject: Re: [project-open-data.github.io] Improve metadata structure for listing web APIs (#291)

An API/service can provide multiple media types, so perhaps a defining the resource type may be useful, and then using mediaType if required. I'm not sure about using a media type to identify an API.

For example an OGC:WMS can provide an addressable URL to something that would be a thumbnail. Which, in this case, I wouldn't define the URL as a WMS, but a simple thumbnail/browse image (which, in this case, just happens to be realized via OGC:WMS).

At the same time I could specify an OGC:WMS base URL and, perhaps, the resource name (layer name in WMS speak) as a means to provide minimal information, which then the client could bind to.

Food for thought.

— Reply to this email directly or view it on GitHubhttps://github.com/project-open-data/project-open-data.github.io/issues/291#issuecomment-37519679.

smrgeoinfo commented 10 years ago

@philipashlock I think perhaps you're trying to solve a narrower set of problems than I'm thinking about. Please give this discussion paper a read and then we can continue the conversation if we're actually working on the same problem (which I think we should be... given the scope of what data.gov is supposed to do).

philipashlock commented 10 years ago

Sorry if I wasn't clear before, but I think the confusion may be that what I'm describing introduces another level of abstraction by simply referring to standardized API metadata documents (eg Swagger, RAML, API Blueprint, IODocs) that describe the API rather than try to describe the API directly. You do still need a way to identify what type of API metadata documents those are and it seems like a media type would be a fine way to do that. Then the API metadata documents themselves are what specify the media types available from the API itself.

@smrazgs I didn't read the paper too thoroughly, but I think we're still mostly on the same page. As an aside, I had to clone the repo to view the file since it doesn't look like github supports downloading the .doc file for some reason - probably just karma for using that format rather than a web friendly one to talk about hypermedia ;)

However, from what I can tell that proposal still depends on there being a separately known definition for whatever is specified by overlayAPI rather than allow another level of abstraction by pointing to metadata that describes the API in full. If I'm correct in understanding that, then I think the scope of that proposal is actually more narrow in this regard.

smrgeoinfo commented 10 years ago

Yes, if the distribution is a non- http ROA type endpoint, then you need to tell the user what kind of overlayAPI is being used, and provide link to the service selfDescription document. The client software has to be able to recognize the identifier for the overlayAPI (thats what alot of the the Cat-interop discussion is about), to know if its one that the client can work with.

I'm interested in other kinds of distribution-related links--templates, example direct data requests, distributions through services that offer lots of datasets so the metadata has to provide some kind of parameter (layer name, feature type...) for the client to know how to construct a request, different information models and profiles on the same media type. Its certainly subject to debate how much to tell the client up front in a list of DCAT:distribution, gmd:CI_onlineResource, atom:links etc,. and how much to put in more specific service description docs that the client has to get and process. My tendency is to try and get more information upfront.

For now the open data project should provide lots of guidance and examples for lots of different kinds of distributions (OGC services, OpenDAP, HDF, OData, ugly-ole WS... as well as simple file-download) on conventions for accessURL, mediaType, format. The beauty of RDF is that its not hard to extend the content.

mhogeweg commented 10 years ago

While I like the idea of auto-discovery of API (SOAP WSDLs do this for the ESB world and OGC GetCapabilities do this for OGC specs to some extent) there would be a lot of work to be done to make descriptions like these for the various API. OGC services use various specs to define what requests you can send and what you can expect back.

it is one thing to know there is a call http://www.example.com/myservice/thing/25 and that it responds with JSON, it's another to know what you can do with this.

It also appears that those documentation systems for RESTful web APIs can use some harmonization to avoid having to implement all of them just in case a client app only understands RAML. is there a role for Data.gov to facilitate or at least drive the developers of those documentations systems toward that standardization?

mhogeweg commented 10 years ago

@justgrimes expanded the discussion on recognizing links in #293. I suggest including his suggestions here.

edsu commented 10 years ago

I'd like to see some discussion of what you are trying to achieve with the proposed changes. What would more information about web services make possible? Why is the additional complexity important?

smrgeoinfo commented 10 years ago

If you want a machine agent to be able to process the metadata and get the user connected to the data through a service (not a file download or web page with instructions on how to get the data), then the link has to have more information that just a URL. Machine actionable links enable workflow composition (that's REST).

gbinal commented 10 years ago

The work around the APIs.json standard is definitely relevant here - http://apisjson.org/format.html

@kinlane

gbinal commented 10 years ago

Note that https://github.com/project-open-data/project-open-data.github.io/commit/1c93d7fd43e3bb4117ee7d3677a7cdf1495c82ce addresses part of this issue by deprecating webService and refining the role of accessURL.

philipashlock commented 9 years ago

I think this has been addressed by several changes that were not specific to APIs, but should be sufficient in addressing the issues discussed here.

As Gray mentioned, we've made some changes (#217, #330, #335) so that accessURL is used exclusively within a distribution object and also redefined so that it's only used for URLs that provide indirect access to a dataset whereas downloadURL should be used for providing a direct download URL to the machine readable representation of the dataset. This means accessURL is now defined in a way that makes sense for a URL that provides documentation and other information on accessing and using an API. These changes were made in part to align with the way they've been defined in DCAT (see #350). Other changes that have been part of that effort include renaming format to mediaType and then allowing format to be used as a human readable description of the format and mediaType. However, an accessURL is assumed to be a human readable webpage and a mediaType isn't required the way it is for downloadURL. With this in mind, I'd suggest that the format field be set as "API" for an accessURL that points to information about using an API.

For APIs that also have machine readable documentation (like Swagger, RAML, API Blueprint, etc) the approach described in #332 is just as applicable for APIs accessible via accessURL as it is for files downloadable via downloadURL. If there's machine readable documentation for an API it can be specified with describedBy and describedByType. The URL for the machine readable documentation would be specified by describedBy and then describedByType would be a media type that identifies the format of the machine readable documentation.

For example

Here's an example of these fields used in a distribution:

"distribution": [
    {
        "accessURL": "http: //www.agency.gov/api/vegetables/", 
        "description": "A fully queryable REST API with JSON and XML output",
        "describedBy": "http: //www.agency.gov/api/vegetables/swagger.json", 
        "describedByType": "application/swagger+json", 
        "format": "API", 
        "title": "Vegetables REST API"
    }
]

For API specs that are more likely to be understood as conforming to an existing standard rather than interrogated via the describedBy URL, the conformsTo field (see #362) could be used to identify a well known standardized specification (eg, WMS, WFS, Open311)

gbinal commented 9 years ago

Thank you for driving the conversation around this issue and helping to assemble the v1.1 metadata update.

There appears to be strong consensus around this issue, which has been accepted in the v1.1 update and merged into Project Open Data.

However, we know that more can be done to improve how APIs are addressed within the schema.
Please continue any conversations around how the schema can be improved with new issues and pull requests!

It's important for government staff as well as the public to continue to collaborate to make the Open Data Policy ever better. Though the v1.1 update is a substantial update, future iterations do not have to be, so whatever your ideas - big or small - please continue to work with this community to improve how government manages and opens its data.