Allow publishers to distinguish between types of identifier

stevenday commented 5 years ago

So that:

the Register can publish useful extra info such as OpenCorporates identifiers, or un-verified identifiers declared by data submitters.

Currently the Register's BODS export of data only outputs an identifier when it's directly from the source register (i.e. GB-COH ids for UK PSC data). However we have other identifiers such as OpenCorporates' jurisdiction/company number, identifiers declared for non-UK companies who own UK companies, and those in data from EITI's third party research.

We'd like to publish these identifiers too, but don't want to release them as identifiers without being able to notify the user that they're not as 'official' as those we currently publish.

I'm not sure exactly what the solution is here - another type field and associated codelist, use an annotation, or something like the notion of embedding a 'source' object within other pieces of data...

I'm mainly creating this issue to start a discussion and see if our current practice is overly conservative.

timgdavies commented 5 years ago

I would suggest there are three possible elements of an approach at this point in time (distinct from the wider issue of whether the next iteration of the standard should include a standardised field to handle this). For the Open Ownership register bulk download, we should consider:

(1) Including the download documentation a user-friendly note on which fields may have been added or updated by the register
(2) Use the annotation object to describe the manipulation that has taken place in-line within each statement

and/or

(3) Including some non-standard in-line 'comment' element that provides a more concise way of identifying manipulated identifiers.

(1) is important, as it allows data users to know in advance the issues they need to watch out for in the data. They should already be aware from https://register.openownership.org/download that they are not getting 'from source' data, but are getting matched and value-added data.

(2) may or may not be possible depending on how easy it is to generate (and use) ordered JSON pointers to particular array elements. I would potentially ignore this for now - and to engage with consumers of the data about their provenance data needs more.

For (3), perhaps something like:

{
   "identifiers": [
      {
        "scheme": "GB-COH",
        "id": "07444723",
        "dataNote":"Identifier added during OpenCorporates reconciliation. Errors and omissions expected."
      }
    ],
}

Given the bulk download should be considered in beta (as is the standard) - and as long as we can communicate clearly with users in the future about any changes, we should avoid getting blocked too long on this.

stevenday commented 5 years ago

@timgdavies - thanks for thinking about this!

What I'm suggesting on our internal ticket is that we do something like:

"identifiers" : [
    // When there's a reliable identifier direct from source (e.g. a UK company number from the UK PSC register)
    {
        "scheme": "GB-COH",
        "id": "07444723"
    },
    // When there's a less reliable identifier direct from source (e.g. a Cayman Islands company number from the UK PSC register
    {
         "schemeName": "CY Jurisdiction, assumed national register"
         "id": "12345678"
    },
    // For every single entity (people and legal entities)
    {
        "schemeName": "OpenOwnership Register"
        "uri": "https://register.openownership.org/entities/123456abcdef"
    },
    // For every single legal entity we match to OC
    {
        "schemeName": "OpenCorporates"
        "uri": "https://opencorporates.com/cy/12345678"
    }
]

This seems to me to be within the letter of the standard, if not the spirit, and doesn't require any non-standard extensions. It totally would require documentation though.

You are correct though that even our 'very reliable' identifiers may, in theory, have been modified by the OC reconciliation process. I'd prefer to just include notes in the documentation page (alongside notes about other fields that come from OC, that's a great suggestion) rather than include them in every single identifier though, purely because it seems redundant to have them there. I'll make a new ticket for us to add that though, because I think it's separate from this issue.

timgdavies commented 5 years ago

@stevenday Ok. This makes good sense to me - and re-reading https://standard.openownership.org/en/v0-2-0/schema/reference.html#schema-identifier looks like it is within spirit and letter, as schemeName is defined as:

"The name of this scheme, where the org-id code is unknown or only an unvalidated string is provided. scheme or schemeName (or both) MUST be included in an Identifier object.'"

From a user perspective, something with a scheme will generally be from an official register, and something with only a schemaName is either not, or is not robustly guaranteed as such (i.e. the CY case, where no we assume not checking of scheme of the identifiers is taking place during input).

Is there still a need for some extension to the standard, or can we close this issue - and just archive for future reference/documentation updates?

stevenday commented 5 years ago

Is there still a need for some extension to the standard, or can we close this issue - and just archive for future reference/documentation updates?

I'm happy if you're happy :) This ticket was really about getting a second opinion on how to do it and whether our suggestion was acceptable, given the register's status as a bit of a flagship for BODS data. It sounds like it is, so I'm happy to close this ticket.

openownership / data-standard

Allow publishers to distinguish between types of identifier #244