Add code to process linked data authorities via configuration

elrayle commented 8 years ago

Assumptions…

Authority defines an HTTP API that accepts a query string and returns one or more results in a supported linked data format.
Authority defines an HTTP API that accepts an ID or URI and returns linked data for the identified term in a supported format.
Supported formats (at this time): application/rdf+xml, application/json (json-ld)

Includes configurations for...

OCLC Fast Linked Data (supports search query and term retrieval)
FAO Agrovoc Linked Data (supports search query and term retrieval)
Library of Congress Linked Data (supports term retrieval only)

See README for details on configuration, results format for queries and terms, and other usage information.

elrayle commented 8 years ago

Please review design and code for this PR which adds the ability to specify linked data authorities. Information on configuration, results format for queries and terms, and other usage information is in the README.

NOTE: As I tried to extend this beyond the included authority configurations, I ran into several challenges.

authority does not have an API for either search or term or both
authority requires setup at the site before the API can be used
- payment required to use the API
- registration of username required
authority has a different approach that is more extensive than a simple single URL HTTP API
- query is defined in an XML file that is submitted via POST
- SPARQL endpoint only is supported

elrayle commented 8 years ago

All tests pass on my local machine. Some of the tests are failing in Travis because the arrays are not returning items in a predictable order. I am fixing those. Then I'll see what's left failing via Travis.

Please review the design in spite of the failing tests. The tests are expected to change, but not the code itself.

elrayle commented 8 years ago

Further review of the errors reveals that I had RDF 1.99.0 on my local machine and Travis was using RDF 2.0.2. When I did bundle update on my local machine, I see the same errors.

The order errors are going to be challenging. Order in this case does matter. For the authorities I am querying, results are returned in RDF in a specific order with the best match appearing first. I know that graphs are inherently unordered, but it was convenient to load the RDF into a graph and then process the graph.

@no-reply Is the change in the way RDF 2.0.2 handles ordering permanent?

no-reply commented 8 years ago

@no-reply Is the change in the way RDF 2.0.2 handles ordering permanent?

Short answer: yes.

Longer version: It's theoretically possible to implement an in memory repository that retains order. You could then force use of this as the default repository to support guaranteed stable order. This has a number of problems: runtime cost, ongoing maintenance on the custom repository, potential conflicts with other libraries relying on the main RDF::Repository::Implementation.

For the authorities I am querying, results are returned in RDF in a specific order with the best match appearing first.

The largest issue is that there's no particular guarantee that any of the steps in this process will return in any given order. Particularly: unless there's some off-spec guarantee of order on the data provider side, the source data order may change without notice; the parser may also output statements in any order.

If there's a point where you can guarantee the order you are looking for, my advice would be to load the statements (or just the values) directly into an Array.

Is there a reason you can't rely on skos:prefLabel and mads:authoritativeLabel in your example data?

elrayle commented 8 years ago

I agree with your comment about 'off-spec guarantee of order'. Depending on authorities to maintain their current implementation is fragile.

For a single term (i.e. same subject URI), statement order probably doesn't matter. The configuration identifies the predicate to use for the label (e.g. skos:prefLabel). Any tests failing order within a single term can be updated to ignore order.

For queries, multiple terms (i.e. many subject URIs) are returned. For OCLC, they return in rdf+xml sorted in order of usage with the highest usage term appearing first in the RDF results. Queries can return hundreds of results. Unordered results at this scale are useless to the end user who desires a reasonably short selection list for choosing the best match. A site can ameliorate this by limiting the number of returned results, for example, to 20 matches. But unless that number is really low, the best match results could be far down the list and still difficult for the end user to locate, especially if its location in the list of 20 is different every time the user types the same query.

no-reply commented 8 years ago

For OCLC, they return in rdf+xml sorted in order of usage with the highest usage term appearing first in the RDF results.

My advice would be to parse these as XML and into an Array.

elrayle commented 7 years ago

Too old to get a clean merge of changes in master since this was created. A new PR will be created with a new branch that is rebased off master.

samvera / questioning_authority

Add code to process linked data authorities via configuration #109