openstate / CKAN-Link-Checker

Checks a CKAN API for the number of broken links
2 stars 2 forks source link

WFS/WMS services are not addressed correctly #1

Closed ndkv closed 9 years ago

ndkv commented 9 years ago

WMS/WFS services require additional parameters to function correctly. Omitting these results in unpredictable behavior. Some servers return an HTTP 400 error, others return an XML-encoded error message, etc.

The entry point to WMS/WFS services is the so-called Capabilities document. This document describes the capabilities of the service: which layers it contains, which coordinate systems it supports, what types of analysis are supported, etc. The Capabilities document is retrieved through a GetCapabilities request as

http://example.geo.service.nl?
service=WMS / WFS&
request=GetCapabilities

One of the services that the link checker reports as broken is the zeer kwetsbare gebieden. Calling the service's URL directly results in a 400 error.

Calling the service correctly i.e. with correct values for the service and request parameters returns a valid Capabitlities document:

http://ags101.prvgld.nl/arcgis/services/INSPIRE_ov/MapServer/WFSServer?
service=WFS&
request=GetCapabilities

To test WMS services change the service parameter to WMS:

http://ags101.prvgld.nl/arcgis/services/INSPIRE_ov/MapServer/WMSServer?
service=WMS&
request=GetCapabilities

Note that some services serve WMS and WFS from the same URL, compare

geoservices.rijkswaterstaat.nl/noord_brabant_brabantse_en_midden_limburgse_kanalen?
service=WMS&
request=GetCapabilities

to

geoservices.rijkswaterstaat.nl/noord_brabant_brabantse_en_midden_limburgse_kanalen?
service=WFS&
request=GetCapabilities

See GeoServer's documentation for a primer on WMS/WFS.

breyten commented 9 years ago

Ah, thanks! This actually explains a lot. But I fail to see how this is an issue for the link checker, since it's impossible for us to determine if any given URL is a WMS/WFS service. So the correct way to fix this would be to fix the links in the data portal to the call that shows the capabilities -- Especially since the actual endpoint is specifically specified in the capabilities response.

ndkv commented 9 years ago

Yes, fixing the links in one of the catalogs (data.overheid.nl / NGR) is the correct solution. That would, however, require a tremendous effort as all entries have to be fixed by hand (right?). For the purpose at hand (check which services are broken and take action to fix them) the easiest solution is to extend the link checker.

Meanwhile we have to figure out how to communicate the above to developers and users. One option is to include a (link to a) short tutorial on WMS/WFS on data.overheid.nl. I am working on such documentation here. This is the source of the 'Voor ontwikkelaars' page on NGR. The page looks quite horrid right now; we've requested Kadaster to fix it by (at the very least) providing a link to the repo and ReadTheDocs.

ndkv commented 9 years ago

The service type is stored in the Formaat property: https://data.overheid.nl/data/dataset/zeer-kwetsbare-gebieden/resource/0f0bf8b7-e48b-4e26-9b7c-423d496e9e1e

Can you read those?

siccovansas commented 9 years ago

Hi @ndkv, thanks for the info! It is really useful to know. I have to agree with @breyten though. The purpose of this CKAN Link Checker is purely to check whether a link works as our focus is on the user of the data portal and not on the maintainer. The user simply wants a working link and must not be expected to read instructions or a document in order to get the link to work. So we do expect that data.overheid.nl or the maintainers of the datasets will correct the links.

Furthermore, this CKAN Link Checker is generic and can be used on other CKAN data portals besides data.overheid.nl, so we don't want to add code which is specific for some datasets from data.overheid.nl.

I definitely agree that it makes sense to make it easy to take action on broken links, but this is something the portal owner should do. CKAN portal owners already could use plugins like ckanext-deadoralive to check their dead links and easily act on it. This CKAN Link Checker puts the link checking power also in the hands of people other than the portal owner :D.

ndkv commented 9 years ago

tl;dr Please fix the checker. :)

I completely agree with you that the links must be stored correctly in the registers. It is extremely frustrating to click on a link and get an obscure error message. Users should, at the very least, get the Capabilities document.

To fix the links, however, we need to know why they are failing. Is the link checker sending them bad requests or are they really broken? Testing the geo-services is somewhat more involved since they are, in essence, APIs. Currently there are three failure modes

  1. a functioning service is addressed incorrectly -> returns 500, 402 but also 200 with error in XML document (which you treat as valid)
  2. a broken service is addressed correctly -> returns 404, 500
  3. a broken service is addressed correctly -> returns 200 with error message in XML (checker says 'yes')

In the first case, the link checker produces false negatives (fn) and false positives (fp):

In the third case the checker is producing false positives: the service is broken but returns a 200 status and the checker labels it as functioning. In both cases users are suffering: they either can't get data that is available or they are presented with broken links.

Addressing these differences is crucial when communicating the results to the folks who have to fix the links: data owners and/or register maintainers. If you report (as you currently do) that their service is broken according to 1, they will disagree since, from their perspective, everything works as it should. Chances are they'll ignore your report and carry on serving unusable services. The false positives are not addressed either since no one knows that they are broken. Result: nothing will change.

Resolving this issue by showing the Capabilities doc to users requires that the links in data.overheid.nl are corrected i.e. the request and service parameters are appended to the URLs. How to achieve this technically is an open question: should data owners amend their links in NGR, should the maintainers of data.overheid.nl fix them on their end (if so, how?), should the NGR -> data.overheid.nl mapping be modified, should the fix be visual (since you'd like to keep the core URL to request data through a request=GetFeature request)?

Whatever the solution, it will require a collective effort due to the sheer number of links, the two separate registers and multitude of stakeholders. To address this issue effectively we need to get the numbers straight, drill down to the core of each failure and implement robust fixes.

Please note that the top 20 domains in your list of failures are filled with geo-services. Addressing the shortcomings I describe here will, probably, reduce the number of broken services considerably. In other words, most of the numbers you report are, very strictly speaking (i.e. from the data owners perspective), wrong and thus irrelevant.

To be clear, I'm not defending the current state of affairs. My goal is to get the links fixed so that they work for expert and novice users under these suboptimal conditions (why the heck do these WMS/WFS services fail when invoked without parameters, right? They should at least return that Capabilities doc... And why do they return a rainbow of error codes...).

ndkv commented 9 years ago

Here's an example of a service that, from the users' perspective, returns an error but the checker flags it as functioning: http://geoservices.rijkswaterstaat.nl/noordzee_natura2000_zee_en_delta

Adding the correct parameters returns the Capabilities doc: http://geoservices.rijkswaterstaat.nl/noordzee_natura2000_zee_en_delta?request=getcapabilities&service=wms

The service is, from the owner's perspective, functioning correctly.

This particular data record contains 10 links which the checker has, from the user's perspective, labelled incorrectly as 'working'. Consequence: the user suffers and your tallies are wrong.

All of this stems from a difference in the definition of a working service. As I said earlier, I agree that the services should Just Work for users when approached from search interfaces such as data.overheid.nl The discussion therefore should not be about how many services are broken, rather, we should figure out how to make them work for users while respecting the owner's view on the matter (see the above post for a possible solution).

pvgenuchten commented 9 years ago

hi, good discussion, a note on @breyten comment "since it's impossible for us to determine if any given URL is a WMS/WFS service". Note that the document in CKAN is converted from an iso19139 document. The originating iso19139 document has for each link a "protocol" attribute that contains a value for WMS or WFS (apparently this is mapped to a field "formaat" in data.overheid.nl, but I would suggest to create an official ckan field for this). See also this discussion to improve interoperability between catalogues (that defines a codelist for protocols): https://github.com/OSGeo/Cat-Interop/issues/1

ndkv commented 9 years ago

It seems there is one already called format, see

https://github.com/ndkv/CKAN-Link-Checker/blob/master/check_ckan_links.py#L170

pvgenuchten commented 9 years ago

Note (as stated in https://twitter.com/thijsbrentjens/status/600195260967432192) that this usecase is a useful contribution to https://www.w3.org/2015/spatial