tskit-dev / tsinfer

Infer a tree sequence from genetic variation data.
GNU General Public License v3.0
56 stars 13 forks source link

Methods to compare Individuals, Populations, and Samples with tskit equivalents #325

Open hyanwong opened 4 years ago

hyanwong commented 4 years ago

I would like to compare a Population as returned from a sample_data file with an Population in a tree sequence. In particular, I would like to test for equality, excluding the id field (perhaps we might call this "equivalence", as in population_equivalent(sd_pop, ts_pop). A similar thing goes for an Individual (see also #324 ).

Should I wait for sgkit to do this, or is it worth implementing a quick hack now? And what's the best way to do it - can I e.g. simply use attr.asdict?

hyanwong commented 4 years ago

I reckon I can do something like this

def exclude_id(attribute, value):
     return attribute.name != "id"

population_equivalent(sd_pop, ts_pop):
    d1 = {k: v for k, v in ts.population(ts_pop).__dict__.items() if k != 'id'}
    d2 = {k: json.dumps(v).encode() if k == 'metadata' else v
        for k, v in attr.asdict(sd.population(sd_pop), filter=exclude_id)}
    return d1 == d2

The only issue is where there are attributes in the sample file that are not in the tree sequence, or vice versa. In particular, I'm thinking about the individuals_time value for sample data files, which has no equivalent in an individual in a tree sequence, until #322 (and after that will require special treatment)