tskit-dev / tskit

Population-scale genomics
MIT License
153 stars 72 forks source link

table collection equivalence #896

Closed mufernando closed 3 years ago

mufernando commented 4 years ago

Right now, equality of table collections is defined in a very strict manner (see here). All tables are compared, in addition to sequence length, metadata and metadata schema.

In discussions with @gtsambos, @petrelharp, @benjeffery and @hyanwong we came to the conclusion that it would be helpful to have a less strict equality (maybe equivalence) comparison, in which the metadata, metadata schema and provenance tables are ignored. I think this equivalence operation is crucial, because in many cases we are interested in comparing the "biological" content of tree sequences, without worrying about how they were made to be.

Importantly, since the metadata attributes were added tskit.TableCollection.union stopped working, because there we subset tree sequences (produced by different runs of simulations) on their shared nodes and test for equality, which will fail because of metadata.

I propose we add two new flags to tsk_table_collection_equals to ignore the metadata/schema and provenance tables when comparing: TSK_IGNORE_METADATA and TSK_IGNORE_PROVENANCE.

Does this sound reasonable?

benjeffery commented 4 years ago

Sounds good, thanks for writing up the issue. Do you see use cases where this is needed in the python API? As we can't pass options to __eq__ we would need to think of a method name .

When you say union now fails, can you give an example? It should be added as a test case. I'm a bit confused as to why it should fail as metadata columns existed before the recent work.

petrelharp commented 4 years ago

Sounds good to me!

I'm a bit confused as to why it should fail as metadata columns existed before the recent work.

This is top-level metadata making it fail.

And, I think the proposal is to ignore only the top-level metadata, not the metadata columns, right @mufernando?

Do you see use cases where this is needed in the python API?

Yes, in testing at least. Maybe we add a tskit.table_collections_equivalent(ts1, ts2, **kwopts) method?

hyanwong commented 4 years ago

Yes, in testing at least. Maybe we add a tskit.table_collections_equivalent(ts1, ts2, **kwopts) method?

There are a few places where we clear provenances before comparing equality. We could use this function there, or (probably better) call things like simplify() with record_provenance=False in the tests.

mufernando commented 4 years ago

Sounds good, thanks for writing up the issue. Do you see use cases where this is needed in the python API? As we can't pass options to __eq__ we would need to think of a method name .

Yes, in testing at least. Maybe we add a tskit.table_collections_equivalent(ts1, ts2, **kwopts) method?

I can add that method to the python side. I think it could be helpful in testing, as @hyanwong pointed above.

When you say union now fails, can you give an example? It should be added as a test case. I'm a bit confused as to why it should fail as metadata columns existed before the recent work.

It is top-level metadata, not metadata columns. I cannot add a test to this because it only happens with SLiM simulations. But I think what I meant by that is that union is "broken", because our test for equality in there is actually a test of equivalence. I don't care about the top-level metadata and other stuff when I test for shared histories here.

mufernando commented 4 years ago

Solved in #897 and #904

hyanwong commented 3 years ago

A trivial fix after chatting with @benjeffery - this function should probably use keyword-only args (i.e. def equals(self, tc, *, ...), as tc.equals(tc2, False, True, True, False) is not terribly obvious. It would be nice to do this before release, so I'm temporarily reopening this issue.