kitchensjn / tskit_arg_visualizer

Interactive visualization method for ancestral recombination graphs
MIT License
11 stars 3 forks source link

Visualising ARGs inferred from the Kreitman dataset #41

Closed hyanwong closed 2 weeks ago

hyanwong commented 12 months ago

In our ARG paper, we show a number of ARGs inferred from the same classic D. melanogaster dataset (by Kreitman):

Screenshot 2023-09-20 at 17 43 47

It would be great to see how these can be displayed by the ts_arg_visualiser. I've zipped up the respective tree sequences below:

Kreitman_SNP_inferred_ARGs.zip

kitchensjn commented 12 months ago

KwARG output visualized with tskit_arg_visualizer - time

KwARG output visualized with tskit_arg_visualizer - rank

Starting with the simplest one, these are for the ARG from KwARG with y_axis_scale="time" and y_axis_scale="rank". For the first, I didn't move the nodes at all. I would say that the ordering of the tips is better in your version. The second one, I've tried to untangle as much as possible. I'm not sure why there are unary nodes (13, 20, 17, 16, etc) in this ARG, but haven't worked with the method before.

hyanwong commented 12 months ago

Hmm, perhaps we are creating a 2-RE-node ARG when converting from kwARG, but not setting the flags that you expect. I.e. it may be that 35 & 36 represent a single "recombination node".

Edit - yes, that is the problem: https://github.com/tskit-dev/what-is-an-arg-paper/blob/4ef243b086a715ad7c26aeeaec1783f189766f07/argutils/convert.py#L145 - see https://github.com/tskit-dev/what-is-an-arg-paper/issues/325

kitchensjn commented 12 months ago

Wouldn't we then expect nodes 35 and 36 to have the same time? I have node 35 at time=387592 and node 36 at time=144455.

hyanwong commented 12 months ago

Good point. Yes we would. I'm not sure why this happens then. I will investigate.

hyanwong commented 12 months ago

Ana points out that kwARG doesn't infer times, so these are somewhat arbitrary. The reason that we have 2 re nodes at different times is that to get semi-reasonable times out of the ARG, we ran the "tsdate" program on the ARG output from kwARG, and tsdate doesn't know to keep the 2 recombination nodes at the same time.

The simplest way around this would be to convert the kwARG output to a 1-RE-node format and keep the mapping of old -> new node IDs. We could then date the 1-re-node version, and port the times back.

This would be useful for redating "full" ARGs from whatever source. The 2-RE-node version is needed to calculate likelihoods under the CwR.

hyanwong commented 11 months ago

Apologies - it looks like I only ran the tsdate algorithm on my local copy. The main repo has this kwARG inference without explicitly dated nodes (just in rank order). I have updated the repo placing the correct node flags at https://github.com/tskit-dev/what-is-an-arg-paper/pull/332. A copy of the Kreitman-kwARG inference is below, and it gives a much nicer picture: Kreitman_SNP_kwarg.trees.zip

Screenshot 2023-09-25 at 21 32 29
hyanwong commented 11 months ago

This is about the best I can come up with (3 crossings):

Screenshot 2023-10-11 at 09 28 06
kitchensjn commented 10 months ago

Here's some starting points for visualizations of the different methods: kreitman_jsons.zip

The first and last trees from the tsinfer trees appear to be empty in the .trees file that I was using; I don't know if that was intentional or not. Also for most of the them, we are stuck using edge_type="line" until I work out the pathing for polytomies. I've added that to the milestone TODO list for v0.0.2. You can almost get away with "ortho" for argweaver, but as the D3ARG object is expecting 2-RE-format when using D3ARG.from_ts(), it doesn't quite work.

Let me know if you have trouble running any of the JSONs!

hyanwong commented 10 months ago

Thanks. We should have a 2 <-> 1 conversion routine, I guess (difficult for diamonds, though).

The tsinfer post-processing step deliberately cuts off the topology before the first site and after the last site, which is why there are empty regions at the start and end of most tsinferred tree sequences.

hyanwong commented 2 weeks ago

I think we are done here