tskit-dev / pyslim

Tools for dealing with tree sequences coming to and from SLiM.
MIT License
26 stars 23 forks source link

Annotating tree sequence with pre-existing (node) metadata can lead to erroneous metadata #327

Open clwgg opened 10 months ago

clwgg commented 10 months ago

I think this might actually be a tskit problem, but I stumbled across it in a pyslim context so I thought I'd report here first, and we can escalate to tskit if needed.

The gist is that if a tree sequence comes with pre-existing metadata (of a certain format), it seems like this metadata can lead to corrupted metadata after annotation. I came across this in a context of a tree sequence output from tsinfer, in which some nodes have metadata of the format b'{"ancestor_data_id": 1}'.

For testing purposes we'll create such a tree sequence "artificially":

import numpy as np
import tskit
import msprime
import pyslim

ts = msprime.sim_ancestry(
    samples=10,
    population_size=10,
    sequence_length=1e6,
    recombination_rate=1e-6,
)

tables = ts.dump_tables()
nms = tables.nodes.metadata_schema
tables.nodes.packset_metadata([
    nms.validate_and_encode_row(b'{"ancestor_data_id": 1}')
    for _ in ts.nodes()
])
ts = tables.tree_sequence()

This results in the expected node metadata, which matches what some nodes look like after a tree sequence is inferred by tsinfer:

tables = ts.dump_tables()
tables.nodes[0]
NodeTableRow(flags=1, time=0.0, population=0, individual=0, metadata=b'{"ancestor_data_id": 1}')
tables.nodes[-1]
NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata=b'{"ancestor_data_id": 1}')

Once we annotate these tables, the metadata for sample nodes is replaced in the way SLiM wants it to be:

pyslim.annotate_tables(tables, model_type="WF", tick=1)
tables.nodes[0]
NodeTableRow(flags=1, time=0.0, population=0, individual=0, metadata={'slim_id': 0, 'is_null': False, 'genome_type': 0})

Non-sample nodes, however, carry erroneous metadata that doesn't seem to make much sense:

tables.nodes[-1]
NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata={'slim_id': 8391162008449393275, 'is_null': True, 'genome_type': 114})

I've tracked this through the pyslim.annotate code, and ended up finding that it appears to happen once a metadata schema is set for nodes with metadata formatted in this way:

tables = ts.dump_tables()
tables.nodes[-1]
NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata=b'{"ancestor_data_id": 1}')
tables.nodes.metadata_schema = pyslim.slim_metadata_schemas['node']
tables.nodes[-1]
NodeTableRow(flags=0, time=80.59032096595365, population=0, individual=-1, metadata={'slim_id': 8391162008449393275, 'is_null': True, 'genome_type': 114})

This is what makes me think this might actually be a tskit issue, since just setting the metadata schema leads to the corruption, which doesn't seem like something that pyslim has a ton to do with. However, since metadata for non-sample nodes are just passed through pyslim.annotate, these malformed metadata end up in the annotated tree sequence if the tree sequence happened to contain non-sample nodes with metadata of this format.

petrelharp commented 10 months ago

Gee, thanks for the report here, @clwgg - #317 is looking more urgent all the time.

I can't look at this until next week, but will have a look then. But, it sounds like changing the schema isn't doing consistency checking, which it probably should, to avoid this sort of thing (unless we decide that "changing a schema" is a user-beware sort of operation, in which case pyslim should be checking for existing metadata).