reconciliation-api / specs

Specifications of the reconciliation API
https://reconciliation-api.github.io/specs/draft/
33 stars 11 forks source link

Using identifierSpace when the entity set uses a mixture of namespaces #139

Closed osma closed 1 year ago

osma commented 1 year ago

There has been some discussion on identifierSpace and schemaSpace before, e.g. in issue #3 and PR #76. The definitions of these have shifted over time. The current definition, in both the latest draft spec and version 0.2, of identifierSpace is:

identifier space The URI namespace (i.e. prefix) for the identifiers of an entity returned by the reconciliation service, for example http://www.wikidata.org/entity/ or https://d-nb.info/gnd/. This URI MAY resolve to a page describing these entities and their identifiers;

We are currently implementing reconciliation API support for Annif (see https://github.com/NatLibFi/Annif/pull/734) and providing the identifierSpace information has caused some headache. Returning the service manifest is mandatory, and also the identifierSpace information is mandatory within the manifest: "A reconciliation service MUST define two URIs [...] identifierSpace ... schemaSpace"

Service manifest Example 1 given in the spec uses this identifierSpace:

"identifierSpace": "http://vocab.getty.edu/doc/#GVP_URLs_and_Prefixes",

(FWIW, I would like to point out that this doesn't seem to match well with the definition - IIRC this is not the URI namespace prefix for any Getty vocabulary, but a URI/URL of a web page explaining them. But that is a separate problem, maybe the example is just outdated.)

Annif uses SKOS vocabularies internally and often those vocabularies use a specific URI namespace; in my understanding, this would be the natural value for identifierSpace. But Annif is currently unaware of this namespace, and there is nothing in principle preventing a vocabulary from using a mixture of namespaces. For example, a vocabulary could consist of a mixture of Wikidata and GND entities. A perhaps more realistic example would be a mixture of YSO concepts and those of a domain-specific extension vocabulary such as KAUNO (fiction literature), JUHO (public administration) or TERO (health and welfare), all of which are extensions of YSO - you can think of them naively as additional concepts to add on top of YSO - that use their own URI namespace which is different from YSO.

So what should Annif return in the service manifest for a project that uses a vocabulary whose URI namespace it isn't aware of? Should it look at all the concept URIs and try to infer what is the longest common prefix? What if the URIs are a mixture of namespaces and there is nothing in common - say, a mixture of http and https URIs?

Or should the value be something more custom (somewhat like the Getty document in the example) that isn't really a URI namespace at all, but is unique to the vocabulary / entity set? For example, the reconciliation service at /rest/v1/projects/myproject/reconcile could return an identifierSpace of /rest/v1/vocabs/myvocab (i.e. the vocabulary used by myproject). That doesn't seem to match the current definition of identifierSpace, as it talks specifically about URI namespace prefixes, but would at least be a shared identifier that could also be referenced by other endpoints at the same Annif instance which use the same underlying vocabulary.

Or is it OK to return an identifierSpace of "" (the current quick-and-dirty solution in the Annif draft PR) since it seems to work fine with OpenRefine - apparently this information is not used at all. Maybe providing identifierSpace shouldn't be a MUST in the spec, if it's actually not used by the main client tool that this API is targeting.

fsteeg commented 1 year ago

That idea of using full URIs as IDs and an empty identifierSpace came up before, I commented on that here: https://github.com/reconciliation-api/specs/issues/39#issuecomment-825570330. Not sure how valid my remarks actually are, in particular whether data extension would really be a problem. If this actually works, I agree it would seem like the identifierSpace should actually be optional.

thadguidry commented 1 year ago

OpenRefine is not the main client tool being targeted. OpenRefine is just one client that follows the reconciliation spec and OpenRefine will continue to evolve as the reconciliation spec evolves as well through versioning. We hope that other client tools and services might adapt the spec and provide feedback which helps standardize concepts of record linkage, linked data, and data augmentation and knowledge merging.

I hope the spec and any documentation through sites we control within W3C and GitHub are very clear about those facts where OpenRefine is not the target but merely a client using the spec. If you read otherwise, please let us know where so we can make that clear.

Regarding your specific namespace questions, I'll let others chime in with their thoughts.

osma commented 1 year ago

Thanks for the clarification @thadguidry . Indeed I know that OR is nowadays just one client among others and my phrasing was not ideal. But I think it has a certain status above others (more equal than other clients?) because of the history of the API.

Rephrasing what I was trying to say above: I don't quite understand the use case for identifierSpace in its current form. It didn't seem to matter to OpenRefine at least. Should it really be a MUST in the spec?

thadguidry commented 1 year ago

@osma I personally think that namespaces should be optional. Reconciling as a process has only a few core concepts entity,property, etc. and a simple reconciliation service about dog breeds or house roof architecture styles might not need to say anything about its entities being classified in a formalized schema or namespace, I.e. it might simply provide an id, entity name and nothing further in a result, with the thought being that its id MIGHT be used or referenced in another service or throughout the linked data ecosystem, ... but we should not push a service to provide a namespace. It would be a recommendation for best practices if they want to be a good citizen in the RDF or lightly in the linked data world. So I think we can change our phrasing to mention some of that. Namespaces are optional, but encouraged if you intend to participate in the linked data ecosystem.

Interested to hear how others think of my views here.

osma commented 1 year ago

@fsteeg wrote:

That idea of using full URIs as IDs and an empty identifierSpace came up before, I commented on that here: https://github.com/reconciliation-api/specs/issues/39#issuecomment-825570330. Not sure how valid my remarks actually are, in particular whether data extension would really be a problem. If this actually works, I agree it would seem like the identifierSpace should actually be optional.

Thank you for the pointer! This aligns pretty well with my thinking. In my view, the problem is not the use of URIs as IDs as such, but the (apparent) requirement in the spec to put the base URI in identifierSpace and only the local part as the ID in reconciliation results. It's doable when the set of entities comes from a single known namespace, but that's not always the case in some real world settings such as a mixture of complementary vocabularies (see OP).

@thadguidry Thanks for your thoughts. Regarding this:

It would be a recommendation for best practices if they want to be a good citizen in the RDF or lightly in the linked data world. Namespaces are optional, but encouraged if you intend to participate in the linked data ecosystem.

To be clear, I didn't intend to propose scrapping linked data URIs entirely, just avoiding the need for an identifierSpace and instead putting the full URI in the ID field. In my understanding, one can be a good RDF / Linked Data citizen without having all entities in the same URI namespace!


Should I propose (as a PR) an amendment to the spec that makes identifierSpace optional, or at least make it clear that an empty string is a valid value for it?

tfmorris commented 1 year ago

OpenRefine is not the main client tool being targeted.

OpenRefine has the largest user base and longest history of experience with reconciliation services, so naturally has a prominent role in future definitions.

a simple reconciliation service ... might simply provide an id, entity name and nothing further in a result, with the thought being that its id MIGHT be used or referenced in another service or throughout the linked data ecosystem

Without a namespace to qualify it, how would one know where this unqualified ID could be used? External out-of-band communication / coordination? That seems HUGELY suboptimal to me.

I think that the concept of a single namespace for all identifiers served by a given service is obsolete (and probably was never ideal). Identifiers SHOULD resolve to full qualified IRIs, whether that be through the declaration of a default namespace or perhaps, more flexibly, through declaring one or more namespace prefixes such as is done in XML & RDF and then using them in conjunction with the identifiers.

wetneb commented 1 year ago

I think we could remove the identifierSpace and schemaSpace altogether, and instead use the view templates.

The existing view template already specifies how the entity ids are turned into URIs. We could add other view templates to turn the properties and types into URIs. If a service returns URIs from different providers, then they likely use the entire URIs as ids, which is likely reflected by their view template.

So for me, the change would be:

osma commented 1 year ago

Thanks @wetneb , that sounds like a nice plan. In my view that feels like two similar but independent sets of changes:

  1. Remove identifierSpace and make the view template for entities mandatory.
  2. Remove schemaSpace and introduce new (perhaps optional) view templates for types and properties.

The first set deals with identifierSpace, the second one with schemaSpace. I'm mainly interested in the first one as it's the more urgent problem right now.

Shall I open a PR implementing the first one only, or a PR that does both, or two separate PRs one for each?

wetneb commented 1 year ago

Great! Feel free to open a PR in any form, I'd say :)

osma commented 1 year ago

I started editing the spec to prepare for a PR, but soon stumbled on this under 7.4 Data Extension Responses (emphasis mine):

The rows object contains, for each entity identifier in the data extension query, for each property identifier in the metadata, the property values of that property in that entity. If the property values are entities, their identifiers are expected to be in the service's identifier space. If that is not the case, the service MUST specify in the meta section the endpoint of another reconciliation service whose identifier space contains the returned entities. This endpoint is specified on a column-per-column basis.

So removing identifierSpace is not that simple as the spec for Data Extension Responses also relies on it. Here it seems to be used not just as a mechanism for expanding local entity IDs into global URIs/IRIs, which can indeed be replaced with view templates, but as a means of directing clients to another relevant reconciliation endpoint. I am not so familiar with the use cases for Data Extension so I'm not sure how important this mechanism is and whether it could be replaced with something else, if we decide to drop the notion of identifier spaces from the spec.

wetneb commented 1 year ago

It does not seem to be a big blocker to me, as the notion of identifier space is just used as a proxy to say "those identifiers are valid for this service". How about something like this? (rough formulation, the language can certainly be improved)

If the property values are entities, they are expected to be valid entities for the service at hand. If that is not the case, the service MUST specify in the meta section the endpoint of another reconciliation service for which the entity ids are valid, i.e. inserting them in the entity view template of that other service yields valid URIs. This endpoint is specified on a column-per-column basis.

osma commented 1 year ago

Newbie contributor question: I see that there are many example files under draft/examples/ which more or less correspond to the current spec text. If I make a PR that changes the spec, should I also adjust the examples accordingly? And in that case, all the examples, or is it enough to change just the ones that have been included in the rendered spec HTML using the ReSpec data-include mechanism?

Asking because there are naturally quite a few examples that use identifierSpace and schemaSpace - after all, these are currently required fields in the service manifest. Some of the examples align better with the current definitions of these fields, others less so. In particular, schemaSpace seems to be used in many different ways in the examples; sometimes as a type URI (e.g. skos:Concept), other times as a URL pointing to an ontology, or the URL of some data model documentation.

osma commented 1 year ago

If I make a PR that changes the spec, should I also adjust the examples accordingly?

Never mind, I already did that. I opened PR #140 which drops both identifierSpace and schemaSpace. Let's see what people think about this somewhat drastic change :)