Strengthening schema validation of statement identifiers

timgdavies commented 7 years ago

An issues related to the draft conceptual model

Each statement published by an organisation MUST have a unique statement identifier.

These identifiers should be persistent.

There is an important question of whether these identifiers should be globally unique or only need to be unique to the given publisher.

Considerations:

Globally unique identifiers aid integration of data from multiple sources.
Publishers may have existing identifiers from their internal systems that they want to use in published data. Any method for generating globally unique identifiers must also ensure statement identifiers are persistent.
Globally unique identifiers are trickier to produce via some data publication methods (e.g. spreadsheet data)
URIs are a possible candidate for globally unique identifiers, but:
- URIs are not very intuitive for users as identifiers when accessed in spreadsheets, databases etc, and do not work well as components of other URIs;
- Many publishers will find it difficult to provide dereferenceable URIs;
- Some publishers may not maintain persistent URIs, as web property locations are often affected by technical changes;

Questions:

What statement identifier requirements should the standard set out. Should it require a particular format of identifier?

How would consuming applications handle locally unique identifiers (i.e. the possibility that two different publishers have the same identifier for different statements)

sebbacon commented 7 years ago

Q: is the only reason not to use guids that they're hard to generate? On its own I don't consider that a particularly strong reason. Locally unique ids are arguably just as hard, or possibly harder, to generate. And we must assume that dirty data will be produced with conflicting ids whatever model we take.

jpmckinney commented 7 years ago

I thought a uuid was easy to generate... Can we elaborate how guids are hard? I think the challenges are surmountable.

jpmckinney commented 7 years ago

(Hope I captured this correctly:) @CountCulture raises the issue on today's call that we don't want to end up in a situation where a publisher maintains locally unique IDs in their systems, then only creates globally unique IDs at the time of publishing. The IDs would preferably be native to the publishers' systems to avoid any drift between the locally and globally unique IDs.

timgdavies commented 7 years ago

I've provided some updated documentation at http://beneficial-ownership-data-standard.readthedocs.io/en/latest/identifiers.html#statement-ids on statement IDs.

Feedback welcome.

sebbacon commented 7 years ago

it enforces a minimum length of 32 characters (the length of a hexidecimal UUID) in order to avoid use of ids that are likely to fail a uniqueness test

Is that a necessary condition? If we want to support URIs on the basis that a publisher controls their own domain as a local namespace, the requirement for a 32 character identifier would make that a rather less useful option.

Perhaps it is simpler just to mandate uuids?

timgdavies commented 7 years ago

I did consider that it would be possible (but unlikely) to have a < 32 character URI that uniquely identifies as statement (the shortest I could envisage would be 15 characters: http://abc.co/1) but thought the trade off between warning against genuinely bad IDs (e.g. '12345678' or 'statement-22') outweighed that - but could also validate IDs in another way / add validation that doesn't place a limit on URIs.

For mandating UUIDs on their own (and validating accordingly), the reason I see for not going that route is that it would require publishers to keep track of the UUID they have assigned to a statement - whereas allowing use of a UUID prefix, and then some internal identifiers, allows unique identifiers to be constructed from already existing data in systems which do not make internal use of UUIDs.

For whether we should just drop support for URLs - we could do that, as it would be possible, I think (though have not checked carefully), for JSON-LD implementations to declare a root namespace against which IDs would resolve into URIs.

sebbacon commented 7 years ago

Picking a minimum key length while allowing arbitrary repetition is a poor guarantor of uniqueness, I suspect.

Also, given that our model is append-only, whereas (probably?) most producers' models will be CRUD, then they're likely to have to maintain a mapping between our UIDs and their "versions" of a given record in any case.

An alternative implementation of uid could be a hash of the record?

timgdavies commented 7 years ago

We've got some basic guidance into the beta for identifiers, but recognise more work needs to be done here.

I've updated issue title and tagged to ensure it is addressed in 1.0RC

timgdavies commented 6 years ago

Related to #44

kathryn-ods commented 6 months ago

https://standard.openownership.org/en/latest/schema/guidance/statement-identifiers.html I think this guidance on statement identifiers answers this question

openownership / data-standard

Strengthening schema validation of statement identifiers #11