Open clwgg opened 10 months ago
Gee, thanks for the report here, @clwgg - #317 is looking more urgent all the time.
I can't look at this until next week, but will have a look then. But, it sounds like changing the schema isn't doing consistency checking, which it probably should, to avoid this sort of thing (unless we decide that "changing a schema" is a user-beware sort of operation, in which case pyslim should be checking for existing metadata).
I think this might actually be a
tskit
problem, but I stumbled across it in apyslim
context so I thought I'd report here first, and we can escalate totskit
if needed.The gist is that if a tree sequence comes with pre-existing metadata (of a certain format), it seems like this metadata can lead to corrupted metadata after annotation. I came across this in a context of a tree sequence output from
tsinfer
, in which some nodes have metadata of the formatb'{"ancestor_data_id": 1}'
.For testing purposes we'll create such a tree sequence "artificially":
This results in the expected node metadata, which matches what some nodes look like after a tree sequence is inferred by
tsinfer
:Once we annotate these tables, the metadata for sample nodes is replaced in the way SLiM wants it to be:
Non-sample nodes, however, carry erroneous metadata that doesn't seem to make much sense:
I've tracked this through the
pyslim.annotate
code, and ended up finding that it appears to happen once a metadata schema is set for nodes with metadata formatted in this way:This is what makes me think this might actually be a
tskit
issue, since just setting the metadata schema leads to the corruption, which doesn't seem like something thatpyslim
has a ton to do with. However, since metadata for non-sample nodes are just passed throughpyslim.annotate
, these malformed metadata end up in the annotated tree sequence if the tree sequence happened to contain non-sample nodes with metadata of this format.