Open benjeffery opened 3 years ago
Thanks @benjeffery, this is a great summary. I'm also fairly relaxed about "additionalProperties":True
- in general, we should be lenient and get out of people's way when it comes to adding metadata.
This comment has a useful clarification on what the required
field can be used for in applications. The default schemas should define what the well-known standard fields are, but should not require them. Applications that write metadata MAY mark any metadata fields that are guaranteed to be present in their own output as required
as a way of communicating useful information about the structure of the data to other software which may be processing it.
I guess the practical import of this is that - by default - now metadata will be like a dict instead of some bytes. That seems like it will make downstream stuff much easier to do, without worrying about the whole schema thing. So, I think this is a great idea (and agree that additionalProperties should be True).
I have some concerns about codec compatibility. As it stands SliM-produced tree sequences are using the struct
codec which has additionalProperties: false
as a requirement. If tools assume they can treat metadata as a dict
then they will fail for such tree sequences.
This metadata scheme is totally not compatible with SLiM's, it's true. But no other metadata scheme is going to be, unless we rewrite the struct
codec. I'm not sure that keeping metadata harder to work with (ie, making people think explicitly about schema) is a good solution to that?
We could provide a method in pyslim (or even tskit) that converts the metadata over to a schema based on the json codec, letting people then add things.
I think we can accomodate SLiM's way of doing things too. But I wonder if we're focusing a bit too much on the rogue-program-updates-the-metadata case rather than the "program that produces the metadata" case. I'm not sure there will be all that many packages that update metadata, and for things that do (like pyslim) we can add functionality to tskit to make this as easy and painless as possible.
For the metadata-producer side of things, I'm leaning towards having a set of basic schemas defined as classes in tskit. The most basic would be LenientJSON
, which would just say the codec is JSON and you can put whatever you want in there. Then maybe we'd have BasePopulationSchema
which just defined name
and description
- we can imagine doing similar things for the top level and other entities.
A metadata producer like msprime then would do something like
tables = tskit.TableCollection()
schema = tskit.BasePopulationSchema()
schema.set_required("name", "description") # These will always be defined in msprime output.
tables.populations.metadata_schema = schema
For tsinfer we'll need to wait until version 0.3 to start doing things properly in terms of metadata schemas, I think. For now, the user will have the option of adding their own schema which will be used for encoding/decoding the metadata.
Recent developments in msprime here and issues about standard metadata keys (#495) prompted @jeromekelleher and I to discuss standardised metadata schemas for entities in a
tskit
tree sequence, and having these be a default. It seems the way ahead is now clear enough that a straw-person proposal for how this could work can be put together. We'd appreciate feedback on how people see this fitting into their workflows and other software, mainly thinking from an "end-user" perspective where it is both easy to use, and hard to write bad code.tskit
would provide a default, standard schema for each entities' metadata, including the top-level. These could live attskit.metadata.NodeDefaultSchema
for example. As a starting point these schemata would use theJSON
codec and allow eitherNone
or adict
as the root type. No keys would berequired
and arbitrary keys could be present. InJSONSchema
this looks like:These schemata could then be extended by tools and end-users with a function that might look like (#1045):
Note that this allows schemata to be modified without the user needing to write
JSONSchema
.tskit.TableCollection()
would, by default, populate aTableCollection
with these schemata. This would be a breaking change, requiring a major version bump. As assigningbytes
to metadata would now fail. To aid in upgrading we could have ano_metadata_schemas
arg toTableColletion.__init__
.The default metadata for
add_row
would remainNone
. TheJSONCodec
would need to be changed to interpretb""
asNone
to avoid wasting 4 bytes"null"
per row on empty metadata.Similarly to extending a schema there would be a method to update existing metadata that also warns on overwrite (#1044).
As they are agreed standard keys could be added to these schemata for example
name
anddescription
for populations as suggested in the msprime code:If a standard name is used on a call to
extend
the schema then a warning is raised and the key overwritten. The user then knows that they are using a key that is standardised. This is the process that will happen when a commonly used key gets standardised. Tools/users will then remove their definitions of that key, if they want to use the standard.I think this plan allows both ease of use for the end-user, while allowing us to standardise the important parts of metadata. The one aspect I am not sure about is allowing
"additionalProperties":True
as this lets metadata that has no entry in the schema to be stored. I tend towards allowing this as it seems too much friction to get a user to define a schema for each piece of metadata, clearly developers of tools in the ecosystem should always ensure that the metadata produced has an entry in the schema as it provides a way to document what the metadata is.