the current proposal typically follows referencing by key value over nesting, for entities. This has some advantages - representation of entities can be shifted to a different document, and referenced from the ppkt (for example, pedigree info in a ped file, variant info in a vcf file, admin info in hospital records). At the same time the format currently allows these to be represented directly in the packet, for convenience. Recall also that the format is not just for cases, but also for entities such as variants, genes, genotypes, etc that may be represented in standard biomedical and bioinformatics databases.
Regardless of the entity type, we have 3 different scenarios.
referencing an entity within the same ppk document
referencing an external entity in a separate ppk document
referencing an external entity in a non-ppk document, e.g. a VCF or PED file; or in some transient database
referencing an external entity in a database that mints stable identifiers
Especially for 4, the need for a global unambiguous scheme is paramount. We will use CURIEs here, with a set of default prefixes, and the ability to add more THIS NEEDS DOCUMENTED.
For other cases, the requirement to have either pre-registered prefixes or a URL scheme may be onerous. For case 1, it's not strictly required, as identifiers can remain local (so long as this is clearly indicated, and client code makes no assumptions that these are global). One idea is to use something semantically equivalent to the concept of blank nodes (existential variables) in RDF. Currently the variant example uses a blank node, with the RDF convention of '_' as the prefix. This is potentially confusing (@pnrobinson had a question about this). We could use a different convention here. (in this particular example, where we are referencing a variant, we can obviate the requirement by having a convention for universal global URIs for variants).
2 and 3 may be more difficult. We can simply ban 2. For 3, it is hard because we may not be in control of how external formats handle identifiers. We may need a bipartite scheme - a way to reference a particular document, and a local scheme for entities in that document, that is format specific, with us referencing entities by concatenating this tuple.
the current proposal typically follows referencing by key value over nesting, for entities. This has some advantages - representation of entities can be shifted to a different document, and referenced from the ppkt (for example, pedigree info in a ped file, variant info in a vcf file, admin info in hospital records). At the same time the format currently allows these to be represented directly in the packet, for convenience. Recall also that the format is not just for cases, but also for entities such as variants, genes, genotypes, etc that may be represented in standard biomedical and bioinformatics databases.
Regardless of the entity type, we have 3 different scenarios.
Especially for 4, the need for a global unambiguous scheme is paramount. We will use CURIEs here, with a set of default prefixes, and the ability to add more THIS NEEDS DOCUMENTED.
For other cases, the requirement to have either pre-registered prefixes or a URL scheme may be onerous. For case 1, it's not strictly required, as identifiers can remain local (so long as this is clearly indicated, and client code makes no assumptions that these are global). One idea is to use something semantically equivalent to the concept of blank nodes (existential variables) in RDF. Currently the variant example uses a blank node, with the RDF convention of
'_'
as the prefix. This is potentially confusing (@pnrobinson had a question about this). We could use a different convention here. (in this particular example, where we are referencing a variant, we can obviate the requirement by having a convention for universal global URIs for variants).It may be better to simply enforce urn:uuids here (see https://en.wikipedia.org/wiki/Uniform_Resource_Name#Examples )
2 and 3 may be more difficult. We can simply ban 2. For 3, it is hard because we may not be in control of how external formats handle identifiers. We may need a bipartite scheme - a way to reference a particular document, and a local scheme for entities in that document, that is format specific, with us referencing entities by concatenating this tuple.