tskit-dev / tskit

Population-scale genomics
MIT License
155 stars 73 forks source link

Remove ordering requirement for the individuals table #1774

Closed bhaller closed 3 years ago

bhaller commented 3 years ago

In #1138 it was decided to require an ordering for the individuals table: children must come after their individual parents. This has turned out to be a problem for SLiM, because it has its own ordering requirements for the individuals table which conflict with this policy. This was not noticed until now because SLiM wasn't using the parents column; but now that it is putting pedigree-based info in the parents column, now we have a conflict between the two ordering policies. Further discussion of that can be found at https://github.com/MesserLab/SLiM/pull/231.

Discussion on Slack indicates that nobody is actually using this ordering requirement for anything, and that the best solution might be to simply remove tskit's ordering requirement. Quoting @jeromekelleher throughout:

Short answer, "no" we're not doing anything with it, and actually, I was wondering if we had got it wrong too. It actually feels quite unnatural when you start working with it, especially if you were to simulate a pedigree in the correct time direction (i.e., backwards :wink:). You'd end up having to reverse the individual table, or else buffer all your individuals and write them in afterwards.

Are you talking about removing the sortedness requirement entirely, or reversing to children-before-parents?

I'm open to either.

...

To be honest I haven't found any advantage to having the sort order so far, it's basically just gotten in the way. Any time you try to rely on it, you end up just sorting by time in any case because weird corner cases crop up with ordering of edges otherwise.

Given that the number of individuals will typically be relatively small, that sort of thing isn't a big deal, so I'd not be particularly upset if we ditched the ordering requirement.

So. Any objections to simply removing this sorting requirement? @petrelharp @benjeffery not sure who else to tag...

gregorgorjanc commented 3 years ago

When building a pedigree-based relatedness matrix, outwith tskit, we want parents before progeny so that we can process pedigree records forward in time and accumulate relatedness coefficients appropriately. But one can do topological sort of the pedigree graph (DAG) to achieve this if needed.

petrelharp commented 3 years ago

When building a pedigree-based relatedness matrix, outwith tskit, we want parents before progeny so that we can process pedigree records forward in time and accumulate relatedness coefficients appropriately. But one can do topological sort of the pedigree graph (DAG) to achieve this if needed.

Well, it'd be a shame to put @benjeffery's nice code that does the sorting to waste, so we could still provide the functionality (e.g. tables.sort(individual_parent_sort=False)).

petrelharp commented 3 years ago

But I think if it's getting in the way we should definitely remove it.

bhaller commented 3 years ago

Hey campers. FYI, it would be good to know the decision on this one way or the other; some work on the SLiM side is waiting to know which way this will go. Implementation is no rush, but a thumbs-up/thumbs-down on the proposal would be helpful. :-> Thanks!

jeromekelleher commented 3 years ago

On balance I'm in favour of removing it. Here's my proposal:

So we basically let go of the idea that individual sortedness of any kind is part of the tree sequence requirements, and pull the current functionality out into methods of the individual table.

Any objections @benjeffery @molpopgen?

bhaller commented 3 years ago

Note that @molpopgen wrote on Slack: "We don't use this ordering requirement. When parent tracking is happening, it is buffered internally and the export to tskit is dealt with at the very end." So he seems fine with it.

benjeffery commented 3 years ago

Thanks for the discussion all, seems that removing the requirement is the right thing to do. Putting the sorting in a separate function as @jeromekelleher and @petrelharp suggests would be good, there's no point throwing away that code.

benjeffery commented 3 years ago

Hey campers. FYI, it would be good to know the decision on this one way or the other

This was posted at 8pm Friday my time! Give us a chance! 😂

bhaller commented 3 years ago

Hey campers. FYI, it would be good to know the decision on this one way or the other

This was posted at 8pm Friday my time! Give us a chance! 😂

Ah, yes, this "weekend" thing. I've heard of that. :->

No worries, just trying to keep the ball rolling. :-> Thanks!

molpopgen commented 3 years ago

The sorting requirement should be removed.

I don't check github notifications on weekends...

petrelharp commented 3 years ago

Sounds like we have a decision!

benjeffery commented 3 years ago

Cool, I can make these changes seeing it was my code in the first place.

benjeffery commented 3 years ago

I'm assuming we still want to use this ordering for tsk_table_collection_canonicalise?

jeromekelleher commented 3 years ago

Yes, I think so. Although it's probably too weak an ordering to actually be canonical, it's a good start and there's no good motivation to change the definition of tsk_table_collection_canonicalise here.

jeromekelleher commented 3 years ago

Can we close this now @benjeffery?

benjeffery commented 3 years ago

Sorry, so used to the github auto-close!