Unique IDs - Githubissues

kinlane commented 7 years ago

Opening up a discussion around a universal unique ID system for records, that all vendors can participate it, allowing single records to have a unique IDs across sytems. Need to conduct more research on best practices, and engage in conversation with vendors.

greggish commented 7 years ago

cc @klambacher who once described a v interesting approach that CIOC takes for this...

klambacher commented 7 years ago

Our IDs are a combination of a 3-letter code identifying the managing Agency (effort is made to avoid Agency Code re-use) and a 4- or 5-digit number, that we call a NUM. Like so: ABC0001. This numbering system itself has been in use for about 40 years, and we've used it for bidirectional data exchange in CIOC (including between several 3rd party systems) for 13 years. Internally, as an implementation detail, we also keep an unchangeable auto-number integer ID.

There are a few advantages to this system:

1) Portability. Record numbers are designed to be unique across all systems; if a the same record number exists in another system it should be a copy of the original record.

2) Traceability. The Agency code part of the numbering system provides some traceability back to the Agency who created the resource (record owner Agency, not the service). This is kept separate from a field that includes the 3-digit owner code only, which can change as record provenance changes without having to change the record number. Records are allowed to be manually re-numbered (e.g. if re-assigned to a new owner) but this is very strongly discouraged.

3) Memorability. The record number is short and memorable (easier for those using 4-digit version), meaning that staff often have naturally memorized numbers of important records without actually trying to do so, and it is easy to communicate about records using their NUMs between staff and also members of the public. It also harder to make a manual transcription mistake vs a long integer ID.

4) Performance. This is admittedly an implementation detail, but through some very extensive testing we found surprising performance gains for NUMs vs integer IDs when dealing with large recordsets (to the point that we switched over some years ago to make this our primary key and our Integer IDs become the secondary ID in all tables.).

We do import from systems that don't support the NUM. For those cases, we hold an "external ID" field that holds a quantity of non-typed data as text (for flexibility, so we can accept any variety of ID). If a record comes in for import without a NUM-format ID, a new NUM is generated and the incoming ID is kept in the external ID field for future matching and updates. The 3-digit Agency code is still required for all records being imported, even for non-NUM ID records, and used to generate the NUM. We also keep a source database field separate from the Agency code, meaning that ID, source database, and Agency owner code (maintainer) can all change independently as needed and still keep traceability of the record over time.

I would expect that, if this concept moved beyond Canada, there would be a need to a) extend the size of the Agency codes and therefore the available pool and b) maintain a master list of codes somewhere for exchange, where people could register their code, look up codes, etc. Historically people ping me to ask if anyone is using a code, but I couldn't do that for thousands of additional record owner agencies! lol

kinlane commented 7 years ago

@klambacher great detail. will be weaving into my research of other existing industry standards, and making a recommendation.

NeilMcKLogic commented 6 years ago

Having imported data now from many hundreds of external databases and systems, I can say that we need to support the possibility of long alphanumeric UIDs because there is so much variation in what people use.

If we want the UIDs to be universally unique then we either need some centralized arbiter handing out IDs (can't see that happening anytime soon) or each vendor and/or dataset instance needs its own UID to be prefixed before record IDs, or concatenated across those fields to create one UUID per record.

Relatedly not all software systems (including mine) give UID's to child record types in HSDS like Accessibiity and Eligibility that the spec calls for. For now we've just been populating those fields with the UID of the parent record but that is not a UUID.

And I'll resist the temptation to assert that best practice in choosing a UID type is to use a monotonically increasing integer. Let's save that for the happy hour session at the First Annual Open Referral conference that @greggish needs to organize in Hawaii someday.

greggish commented 6 years ago

roger that, adding to my tiki bar agenda wishlist

NeilMcKLogic commented 6 years ago

On another note, when one does an Insert, is it expected to provide a UID, or should the Response include one that is autoassigned by the receiving system?

What if the request provides a UID that conflicts with one that already exists in that system? Should it yield an error message and also maybe suggest a new UID that would not conflict, rather than have the poor requestor systematically guess at alternative UIDs ?

@kinlane I think the spec needs a position on this.

timgdavies commented 6 years ago

I think the key question here is whether data providers or data consumers/aggregators/API backends are responsible for ensuring the uniqueness of identifiers

If data providers are responsible, there are three broad patterns we could adopt:

GUID - require that publishers create and store a GUID against each of their records, and use this when communicating with other systems about those records. This relies on the low statistical chance of identifier clash to ensure uniqueness.
Publisher-defined prefixes - in which the publisher chooses a prefix to apply to their internal IDs to make them unique, much like the agency code @klambacher mentions above - although to avoid these clashing with agency codes from other parts of the world, when exchanging data via HSDS / OR APIs it might be appropriate to include CA on the front of these IDs as well.
Registered prefix - this is the approach we take in Open Contracting and 360Giving to provide a light-weight process to register and get a unique prefix that can be prepended onto any identifiers to make them globally unique. Linked Data approaches use URIs to defer to the DNS system the task of 'registering' a prefix. OCDS and 360Giving provide their own light-weight registration process, which might also be an option here for Open Referral to provide a simple place to 'claim' certain prefixes and avoid prefix clash.

The GUID pattern is the most distributed. The registered prefix approach the most centralised - but that centralisation can be good for keeping track of the community of users of a common spec and standard, and need not involve heavy processes.

If data consumers are responsible, then they need to assign each incoming source of data with a prefix, and use that in their internal records of the ID, whilst recognising that two sources might both have identical identifiers for different resources.

On the question of whether the data provider, or the API, should assign an identifier for an INSERT, unless we go with a strict GUID approach, I would suggest that both options need to be available, as in some cases a provider may be synchronising existing identified records: in other cases, creating brand new records.

openreferral / api-specification

Unique IDs #35