tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

sort so mutation parents are above children #1232

Open petrelharp opened 3 years ago

petrelharp commented 3 years ago

Now #1199 does this for individuals (yay!) but we still don't have a method, besides canonicalise, that will do this. Seems like it should be a standard part of sort. We could:

  1. decide that people who need this should use canonicalise, or
  2. copy the relevant code over from canonicalise, or
  3. be fancier and make it optional for performance reasons

I think the main impact on performance would be memory use? And, it's probably not a big deal?

I think this is the same thing as #1227, but perhaps what's happening there is different (and thus a bug).

jeromekelleher commented 3 years ago

Would we do this using the same topological sort algorithm as individuals @petrelharp? If so, I'd vote for taking the same strategy as individuals: doing the O(n) topological sort as for TableCollection.sort() and the more expensive version for canonicalise(). I'm sure we could abstract the topological sorting code out if we wanted to, but it may not be worth it.

The next question is, when do we need this? I think we can probably put this off until the next C release (i.e., not the impending one), since we have non-parent sorting code for mutations out in the wild already?

benjeffery commented 3 years ago

I'd also like to defer this until the next release as this one is already long in the tooth.

petrelharp commented 3 years ago

Yep, no hurry on this.

petrelharp commented 3 years ago

About the algorithm: if the number of mutations per site is k, the canonicalise implementation is O(n log k), so I don't think the topological sort has any practical advantages (it might even be slower in typical cases?).

jeromekelleher commented 3 years ago

About the algorithm: if the number of mutations per site is k, the canonicalise implementation is O(n log k), so I don't think the topological sort has any practical advantages (it might even be slower in typical cases?).

Yes, that's true, but consistency is nice? I find the topological sort algorithm very pretty, and I really doubt it could be slower, as calling qsort() lots of times is much more expensive than a bunch of linear passes through the arrays.

benjeffery commented 3 years ago

Note: The test at https://github.com/tskit-dev/tskit/blob/main/python/tests/test_tables.py#L4195 should be reinstated when this is fixed.

jeromekelleher commented 3 years ago

Adding this to the 1.0 release for now. I guess it's not a critical thing API wise, though?

jeromekelleher commented 3 years ago

@petrelharp we're triaging stuff for the C 1.0 API. Since this doesn't affect the API as such, I think we can probably drop it from this milestone (and move to C Upcoming). Unless you'd like to pick it up?

petrelharp commented 3 years ago

Sounds good.