tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

Add tsinfer specific schema for node metadata. #416

Closed jeromekelleher closed 3 months ago

jeromekelleher commented 3 years ago

We can add a Struct schema to the NodeTable which will let us efficiently store information about the inference process. Currently, this should contain "ancestor_data_id" (default = -1) and "sample_data_id" (default=-1). I made a start on doing this, but it turned out to be quite messy, mainly due to the lack of default value handling in tskit metadata.

Once these two issues are resolved we have another look at this:

https://github.com/tskit-dev/tskit/issues/1073 https://github.com/tskit-dev/tskit/issues/1084

It's not clear whether the struct schema of JSON would end up being most efficient here, but we should start with struct anyway just to see how it works out.

This is purely an interface for tsinfer to encode information about the nodes - tsinfer users will not have the option of setting or updating the schema.

Follow on from #392.

hyanwong commented 2 years ago

I'm revisiting this as (a) those two tskit issues have now been fixed and (b) I'm thinking about how tsdate adds metadata to nodes (specifically, the mean estimated time for a node and the variance associated with that time). At the moment this is simply dumped into the node metadata, which probably overwrites any existing values.

The need to add to the struct by tsdate makes me thing that perhaps the permissive JSON schema would be best for node metadata in tsinfer?

Alternatively, since tsdate is primarily going to be used by tsinfer, we could have some logic in tsdate that checks whether the schema matches tsinfer.node_schema and if so, swapps it for a schema which also includes space for the mean time and variance values. This would be more fiddly, but much more efficient in terms of space, I think.

hyanwong commented 2 years ago

P.s. just to note that sample data files don't contain metadata that can be set on nodes (only on individuals), so tsinfer is free to use whatever schema it wants for node metadata, without stomping on any user data.

hyanwong commented 2 years ago

I think the schema should look like this:

metadata.MetadataSchema(
    {
        "codec": "struct",
        "type": ["object", "null"],
        "properties": {
            "ancestor_data_id": {
                "description": "",
                "type": "integer",
                "binaryFormat": "i",
                "default": -1,
            },
            "sample_data_id": {
                "description": "Date of sample collection in ISO format",
                "type": "integer",
                "binaryFormat": "i",
                "default": -1,
            },
        },
        "required": ["ancestor_data_id", "sample_data_id"],
        "additionalProperties": False,
    }
)
jeromekelleher commented 2 years ago

I'm hesitant to use a struct schema to be honest, wouldn't JSON be a a lot simpler?

Not sure why we'd set defaults as well as marking them as required?

Looks like your descriptions are off?

hyanwong commented 2 years ago

I'm hesitant to use a struct schema to be honest, wouldn't JSON be a a lot simpler?

You said in the initial text of this issue "we should start with struct anyway just to see how it works out". Note that for big inferences, it might be quite tedious storing the JSON for each node (27 million ancestors in the unified genealogy, right?)

Not sure why we'd set defaults as well as marking them as required?

I thought they had to be required for a struct, but would take the default if not given? I probably misunderstood the meaning of the two or the interaction between them though.

Looks like your descriptions are off?

Oops, yes. Thanks for catching this.

jeromekelleher commented 2 years ago

Yes, that's true. 27M is quite a few, so the difference would be a few hundred megabytes. All right, I think that's a good justification for going with struct, let's try it out.

hyanwong commented 2 years ago

Given that this is not a metadata field controlled by the user, I agree that we should try struct first. https://github.com/tskit-dev/tskit/issues/2129 might be a useful thing, which could be used if the user really wanted to change the type to JSON for later modification.

hyanwong commented 2 years ago

If we are using a schema, then we can also save this stuff under a "tsinfer" key, without taking up lots of space, which I think is a lot cleaner. So I propose using a schema like this:

schema = {
    "codec": "struct",
    "type": ["object", "null"],
    "properties": {
        "tsinfer": {
            "description": "Information about node identity from the tsinfer inference process",
            "type": "object",
            "default": {"ancestor_data_id": -1, "sample_data_id": -1},
            "properties": {
                "ancestor_data_id": {
                    "description":
                        "The corresponding ancestor ID "
                        "in the ancestors file created by the inference process, "
                        "or -1 if not applicable",
                    "type": "number",
                    "binaryFormat": "i",
                    "default": -1,
                },
                "sample_data_id": {
                    "description": 
                        "The corresponding sample ID "
                        "in the sample data file used for inference, "
                        "or -1 if not applicable",
                    "type": "number",
                    "binaryFormat": "i",
                    "default": -1,
                },
            },
        },
    },
    "additionalProperties": False,
}