phenopackets / phenopacket-format

26 stars 10 forks source link

Document mechanism for referencing entities within and across documents #22

Open cmungall opened 8 years ago

cmungall commented 8 years ago

the current proposal typically follows referencing by key value over nesting, for entities. This has some advantages - representation of entities can be shifted to a different document, and referenced from the ppkt (for example, pedigree info in a ped file, variant info in a vcf file, admin info in hospital records). At the same time the format currently allows these to be represented directly in the packet, for convenience. Recall also that the format is not just for cases, but also for entities such as variants, genes, genotypes, etc that may be represented in standard biomedical and bioinformatics databases.

Regardless of the entity type, we have 3 different scenarios.

  1. referencing an entity within the same ppk document
  2. referencing an external entity in a separate ppk document
  3. referencing an external entity in a non-ppk document, e.g. a VCF or PED file; or in some transient database
  4. referencing an external entity in a database that mints stable identifiers

Especially for 4, the need for a global unambiguous scheme is paramount. We will use CURIEs here, with a set of default prefixes, and the ability to add more THIS NEEDS DOCUMENTED.

For other cases, the requirement to have either pre-registered prefixes or a URL scheme may be onerous. For case 1, it's not strictly required, as identifiers can remain local (so long as this is clearly indicated, and client code makes no assumptions that these are global). One idea is to use something semantically equivalent to the concept of blank nodes (existential variables) in RDF. Currently the variant example uses a blank node, with the RDF convention of '_' as the prefix. This is potentially confusing (@pnrobinson had a question about this). We could use a different convention here. (in this particular example, where we are referencing a variant, we can obviate the requirement by having a convention for universal global URIs for variants).

It may be better to simply enforce urn:uuids here (see https://en.wikipedia.org/wiki/Uniform_Resource_Name#Examples )

2 and 3 may be more difficult. We can simply ban 2. For 3, it is hard because we may not be in control of how external formats handle identifiers. We may need a bipartite scheme - a way to reference a particular document, and a local scheme for entities in that document, that is format specific, with us referencing entities by concatenating this tuple.