popgenmethods / SINGER

Sampling and inference of genealogies with recombination
MIT License
24 stars 4 forks source link

Set `tables.mutations.parent` column when converting to tskit #11

Open nspope opened 7 months ago

nspope commented 7 months ago

Currently, the tables.mutations.parent column is not set during conversion to tree sequence format. But when this is the case and there are multiple mutations at a site that "stack" (e.g. are descended from one another) then some tskit routines will bug out (see https://github.com/tskit-dev/tskit/issues/2923).

There's a helper function in tskit to compute the mutation parents:

tables = tskit.TableCollection()
# build tables here
tables.compute_mutation_parents()
tables.sort()
ts = tables.tree_sequence()

but, this routine requires that mutation IDs are ordered such that a given mutation's parent has a lower ID (e.g. was inserted into the table earlier). This isn't the case currently, but if the mutations are added in reverse time order at a given position, then the above should work.

YunDeng98 commented 7 months ago

Thanks a lot @nspope ! I will follow your suggestion and make a new converter.

nspope commented 7 months ago

Just a note to self that this might cause issues for tskit's stats with flipped mutations -- if I'm understanding SINGER output correctly, if a mutation is flipped, then SINGER introduces a new mutation above the root that changes the state from 0 -> 1 (so that the original mutation then changes the state back from 1 -> 0). I'm not sure if the missing mutations.parent attribute will screw up tskit stats in this case, or not (given the "parent" is above the root).

nspope commented 7 months ago

The code snippet above isn't quite right -- what seems to be the easiest is rebuilding the mutation table at the end, then sorting/indexing/computing parents. Here's what works:

# <build tables here>
# rebuild mutations table in time order at each position
mut_time = tables.nodes.time[tables.mutations.node]
mut_coord = tables.sites.position[tables.mutations.site]
mut_order = np.lexsort((-mut_time, mut_coord))
mut_state = tskit.unpack_strings(
    tables.mutations.derived_state, 
    tables.mutations.derived_state_offset,
)
mut_state, mut_state_offset = tskit.pack_strings(np.array(mut_state)[mut_order])
tables.mutations.set_columns(
    site=tables.mutations.site[mut_order],
    node=tables.mutations.node[mut_order],
    time=np.repeat(tskit.UNKNOWN_TIME, tables.mutations.num_rows),
    derived_state=mut_state,
    derived_state_offset=mut_state_offset,
)
tables.sort()
tables.build_index()                                                                                                                                                                                                                                   
tables.compute_mutation_parents()
ts = tables.tree_sequence()
YunDeng98 commented 7 months ago

Hi @nspope , thanks for your input on the mutation handling. I feel that my current implementation is not quite fulfilling all requirements. I happened to need to do some mutational analysis myself, so I will look into this systematically over the next few days. It seems that your previous answer is a good solution and I will test and incorporate it into a potential new converter.