tskit-dev / tskit

Population-scale genomics
MIT License
152 stars 71 forks source link

Unstable individual table topological sort #1653

Open nikbaya opened 3 years ago

nikbaya commented 3 years ago

When tsk_individual_table_topological_sort is called, it may rearrange an individual table that is already in a topologically sorted order. For example, if we have two parents and their child in a table, the parents may switch IDs (see below for issue replication). The order of the parents does not matter in terms of the validity of the table, but it would be desirable to not sort individuals if there is no need.

Replication:

def add_individual(tables, time, parents=(-1, -1)):
    # Use flags to preserve original individual ID
    ind_id = tables.individuals.add_row(parents=parents, flags=len(tables.individuals))
    for _ in parents:
        tables.nodes.add_row(
            flags=tskit.NODE_IS_SAMPLE if time == 0 else 0,
            time=time,
            population=0,
            individual=ind_id,
    )
    return ind_id

tables = tskit.TableCollection(sequence_length=100)
tables.populations.add_row()

parents = [add_individual(tables, time=1) for _ in range(2)]
add_individual(tables, time=0, parents=parents)

print(tables.individuals)
# ╔══╤═════╤════════╤═══════╤════════╗
# ║id│flags│location│parents│metadata║
# ╠══╪═════╪════════╪═══════╪════════╣
# ║0 │    0│        │ -1, -1│     b''║
# ║1 │    1│        │ -1, -1│     b''║
# ║2 │    2│        │   0, 1│     b''║
# ╚══╧═════╧════════╧═══════╧════════╝
# Note: We've added flags that correspond to the original individual IDs

tables.sort()

print(tables.individuals)
# ╔══╤═════╤════════╤═══════╤════════╗
# ║id│flags│location│parents│metadata║
# ╠══╪═════╪════════╪═══════╪════════╣
# ║0 │    1│        │ -1, -1│     b''║
# ║1 │    0│        │ -1, -1│     b''║
# ║2 │    2│        │   1, 0│     b''║
# ╚══╧═════╧════════╧═══════╧════════╝
# Note: The flags correspond to the original individual IDs
jeromekelleher commented 3 years ago

@nikbaya and I spent a good bit scratching our heads over this one today - it's a bit surprising that we reverse the order of individuals in a "time slice", isn't it?

@nikbaya - I don't think we need to invoke msprime here, we can simply call tables.sort(), deleting the call to build_index and sim_ancsestry.

benjeffery commented 3 years ago

I wrote that sorting code back in March - from a quick look I can't see why this is happening. Will need to dig in. I'm not sure skipping sort if unneeded is desirable, it would require a table scan before sorting.

molpopgen commented 3 years ago

Any chance this is Linux vs macos again, where qsort is unstable on the latter?

petrelharp commented 3 years ago

We tried to make that sort stable...

benjeffery commented 3 years ago

If you do the example above and then sort again it doesn't swap them back, so i don't think it is unstable.

jeromekelleher commented 3 years ago

Stable sort means that things that are already sorted don't get reordered.

Do you have a link to the discussion about where we talked about the stability @petrelharp? It's fine if this is a "wontfix", but we should probably document it.

petrelharp commented 3 years ago

Here it is: https://github.com/tskit-dev/tskit/pull/1199 - ignore all the stuff about canonical sorting and subset. From that thread,

the sort here is decreasing by "minimum number of hops to a childless descendant, and then smallest position in the table among the closest childless descendant".

Hm, except now I think that should be "maximum" and "furthest", not "minimum" and "closest"? I'm not sure because of the "descreasing" bit.

However, that still doesn't explain why this example gets re-ordered. It must also have position in the list of parents of the closest/furthest childless descendant with the smallest ID" as a final sort key.

So, this is stable, but it's stable with respect to the stricter ordering defined above. This is fine (IMO), and indeed required if we want to have a canonical sort. We could make the argument that reversing the order of the final key (order in the list of parents) would be more natural, as that would make this example less surprising, though?