Open nspope opened 7 months ago
Thanks a lot @nspope ! I will follow your suggestion and make a new converter.
Just a note to self that this might cause issues for tskit's stats with flipped mutations -- if I'm understanding SINGER output correctly, if a mutation is flipped, then SINGER introduces a new mutation above the root that changes the state from 0 -> 1 (so that the original mutation then changes the state back from 1 -> 0). I'm not sure if the missing mutations.parent attribute will screw up tskit stats in this case, or not (given the "parent" is above the root).
The code snippet above isn't quite right -- what seems to be the easiest is rebuilding the mutation table at the end, then sorting/indexing/computing parents. Here's what works:
# <build tables here>
# rebuild mutations table in time order at each position
mut_time = tables.nodes.time[tables.mutations.node]
mut_coord = tables.sites.position[tables.mutations.site]
mut_order = np.lexsort((-mut_time, mut_coord))
mut_state = tskit.unpack_strings(
tables.mutations.derived_state,
tables.mutations.derived_state_offset,
)
mut_state, mut_state_offset = tskit.pack_strings(np.array(mut_state)[mut_order])
tables.mutations.set_columns(
site=tables.mutations.site[mut_order],
node=tables.mutations.node[mut_order],
time=np.repeat(tskit.UNKNOWN_TIME, tables.mutations.num_rows),
derived_state=mut_state,
derived_state_offset=mut_state_offset,
)
tables.sort()
tables.build_index()
tables.compute_mutation_parents()
ts = tables.tree_sequence()
Hi @nspope , thanks for your input on the mutation handling. I feel that my current implementation is not quite fulfilling all requirements. I happened to need to do some mutational analysis myself, so I will look into this systematically over the next few days. It seems that your previous answer is a good solution and I will test and incorporate it into a potential new converter.
Currently, the
tables.mutations.parent
column is not set during conversion to tree sequence format. But when this is the case and there are multiple mutations at a site that "stack" (e.g. are descended from one another) then some tskit routines will bug out (see https://github.com/tskit-dev/tskit/issues/2923).There's a helper function in tskit to compute the mutation parents:
but, this routine requires that mutation IDs are ordered such that a given mutation's parent has a lower ID (e.g. was inserted into the table earlier). This isn't the case currently, but if the mutations are added in reverse time order at a given position, then the above should work.