tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

Standard metadata schemas by default #1051

Open benjeffery opened 3 years ago

benjeffery commented 3 years ago

Recent developments in msprime here and issues about standard metadata keys (#495) prompted @jeromekelleher and I to discuss standardised metadata schemas for entities in a tskit tree sequence, and having these be a default. It seems the way ahead is now clear enough that a straw-person proposal for how this could work can be put together. We'd appreciate feedback on how people see this fitting into their workflows and other software, mainly thinking from an "end-user" perspective where it is both easy to use, and hard to write bad code.

tskit would provide a default, standard schema for each entities' metadata, including the top-level. These could live at tskit.metadata.NodeDefaultSchema for example. As a starting point these schemata would use the JSON codec and allow either None or a dict as the root type. No keys would be required and arbitrary keys could be present. In JSONSchema this looks like:

tskit.metadata.MetadataSchema({
    'codec': 'json',
    'type': ['object', 'null'],
    'required': [],
    'additionalProperties': True,
    })

These schemata could then be extended by tools and end-users with a function that might look like (#1045):

populations.metadata_schema = populations.metadata_schema.extend(name="region", type="string", description="Region associated with this population")

Note that this allows schemata to be modified without the user needing to write JSONSchema.

tskit.TableCollection() would, by default, populate a TableCollection with these schemata. This would be a breaking change, requiring a major version bump. As assigning bytes to metadata would now fail. To aid in upgrading we could have a no_metadata_schemas arg to TableColletion.__init__.

The default metadata for add_row would remain None. The JSONCodec would need to be changed to interpret b"" as None to avoid wasting 4 bytes "null" per row on empty metadata.

Similarly to extending a schema there would be a method to update existing metadata that also warns on overwrite (#1044).

As they are agreed standard keys could be added to these schemata for example name and description for populations as suggested in the msprime code:

tskit.metadata.MetadataSchema({
    'codec': 'json',
    'type': ['object', 'null'],
    'required': [],
    'additionalProperties': True,
    'properties': {'name': {'type': 'string',
                   'pattern': '^[A-Za-z_][A-Za-z0-9_]*'},
                   'description': {'type': ['string', 'null']}},
    })

If a standard name is used on a call to extend the schema then a warning is raised and the key overwritten. The user then knows that they are using a key that is standardised. This is the process that will happen when a commonly used key gets standardised. Tools/users will then remove their definitions of that key, if they want to use the standard.

I think this plan allows both ease of use for the end-user, while allowing us to standardise the important parts of metadata. The one aspect I am not sure about is allowing "additionalProperties":True as this lets metadata that has no entry in the schema to be stored. I tend towards allowing this as it seems too much friction to get a user to define a schema for each piece of metadata, clearly developers of tools in the ecosystem should always ensure that the metadata produced has an entry in the schema as it provides a way to document what the metadata is.

jeromekelleher commented 3 years ago

Thanks @benjeffery, this is a great summary. I'm also fairly relaxed about "additionalProperties":True - in general, we should be lenient and get out of people's way when it comes to adding metadata.

jeromekelleher commented 3 years ago

This comment has a useful clarification on what the required field can be used for in applications. The default schemas should define what the well-known standard fields are, but should not require them. Applications that write metadata MAY mark any metadata fields that are guaranteed to be present in their own output as required as a way of communicating useful information about the structure of the data to other software which may be processing it.

petrelharp commented 3 years ago

I guess the practical import of this is that - by default - now metadata will be like a dict instead of some bytes. That seems like it will make downstream stuff much easier to do, without worrying about the whole schema thing. So, I think this is a great idea (and agree that additionalProperties should be True).

benjeffery commented 3 years ago

I have some concerns about codec compatibility. As it stands SliM-produced tree sequences are using the struct codec which has additionalProperties: false as a requirement. If tools assume they can treat metadata as a dict then they will fail for such tree sequences.

petrelharp commented 3 years ago

This metadata scheme is totally not compatible with SLiM's, it's true. But no other metadata scheme is going to be, unless we rewrite the struct codec. I'm not sure that keeping metadata harder to work with (ie, making people think explicitly about schema) is a good solution to that?

We could provide a method in pyslim (or even tskit) that converts the metadata over to a schema based on the json codec, letting people then add things.

jeromekelleher commented 3 years ago

I think we can accomodate SLiM's way of doing things too. But I wonder if we're focusing a bit too much on the rogue-program-updates-the-metadata case rather than the "program that produces the metadata" case. I'm not sure there will be all that many packages that update metadata, and for things that do (like pyslim) we can add functionality to tskit to make this as easy and painless as possible.

For the metadata-producer side of things, I'm leaning towards having a set of basic schemas defined as classes in tskit. The most basic would be LenientJSON, which would just say the codec is JSON and you can put whatever you want in there. Then maybe we'd have BasePopulationSchema which just defined name and description - we can imagine doing similar things for the top level and other entities.

A metadata producer like msprime then would do something like

tables = tskit.TableCollection()
schema = tskit.BasePopulationSchema()
schema.set_required("name", "description") # These will always be defined in msprime output.
tables.populations.metadata_schema = schema

For tsinfer we'll need to wait until version 0.3 to start doing things properly in terms of metadata schemas, I think. For now, the user will have the option of adding their own schema which will be used for encoding/decoding the metadata.