Namespaced identifiers - Githubissues

parkan commented 8 years ago

We need some way to store and match namespaced identifiers for purposes of deduplication and lookup. Three basic cases:

Well-formed, fully qualified, semantic identifier (globally unique) doi:10.1000/182
Well-formed, but not semantically qualified id (still almost certainly globally unique, barring errors) ISBN 978-7-5366-9293-0
Arbitrary internal id id: 12345

I am guessing we can preserve the original namespace labels for the first two categories, and have to synthesize them for the third.

Question: are the NS stored as prefixes (ns:id) or separate fields? Given that the number can grow almost w/o bound, can OrientDB support these as separate fields on heterogenous objects?

yusefnapora commented 8 years ago

Is it crazy to think of these ids as vertexes in their own right?

Like you'd have an ISBN vertex class, or a TateCollectionId, etc... with an IdentifiedBy edge between the blob and the identifier vertex...

That probably is crazy, since you need to traverse the edge to match the ids... and we'd end up with a huge number of ID vertices.

But it would let us have an unbounded set of namespaces, each with their own index on the actual id field. And it means you could attach multiple ids to a single blob. The NYPL collection, for example has at least three different ids for each artifact :)

parkan commented 8 years ago

Given that IdentifiedBy is the only relationship that we expect to have with this object type I don't feel like making them first class is appropriate -- I was also tempted to store the fully qualified namespace URI/version/etc on the node, then realized that this should definitely be normalized out.

Probably warrants a quick read of Orient docs

mediachain / L-SPACE

Namespaced identifiers #26