tskit-dev / what-is-an-arg-paper

Manuscript and code for the "What is an ARG?" paper
1 stars 8 forks source link

SARGE conversion #33

Closed awohns closed 2 years ago

awohns commented 2 years ago

I imagine it would be useful to have a means of converting genealogies inferred by sarge to the tree sequence format for the figure comparing the output of inference methods? I don't think this has been done by anyone (doesn't look like it's in the sarge repo) so I'm happy to have a go at it.

jeromekelleher commented 2 years ago

Sounds good - can you paste in some information about the SARGE format (and maybe an example) here for discussion please?

jeromekelleher commented 2 years ago

@awohns - we discussed this the other day and I think the decision was that converting the SARGE output to an ARG would be too difficult, as they don't provide anything that's particularly ARG-like?

awohns commented 2 years ago

Sorry for the slow movement here @jeromekelleher. I just had a look through the supplementary methods and it looks like SARGE does, in fact, infer recombination nodes: these figures from the SM are the relevant ones. So I'll have another go at running the method (on a linux machine) and report back shortly. Screen Shot 2022-01-31 at 8 54 11 PM Screen Shot 2022-01-31 at 8 56 16 PM Screen Shot 2022-01-31 at 8 58 25 PM

awohns commented 2 years ago

Ok, so making progress: the following lines run Sarge on a single tree from mspms

mspms 100 1 --mutation-rate 1e-4 --population-size 1 10000 -T --random-seed 1 2 3 > testoutput.ms python2 utilities/ms2sarge_input.py -m testoutput.ms -o test_sarge_input --num_bases 100 bin/sarge --sites test_sarge_input.sites --input test_sarge_input.geno.gz -o test_output_sarge bin/index output.sarge seq:1-100 | bin/trees2newick -v output_test.haps > test_newick_output

I think the best plan here is to take two output files: newick tree and the recomb file:

which is a tab separated file of chromosome, start position, end position, upstream clade, downstream clade, and moving clade.

and then make an ARG out of it with tskit

Currently having an issue when using mspms with recombination, see the issue I've made in the Sarge repo, but once this is solved I make a conversion script.

jeromekelleher commented 2 years ago

Sounds great @awohns! Are the internal nodes in the SARGE newick labelled? Sounds like we're going to be constructing a tree sequence then from (essentially) a set of edge diffs?

awohns commented 2 years ago

Here's output for a single tree, doesn't look like it internal nodes are labeled

((Human-1-1:0.22222,(Human-1-2:0.11111,Human-2-1:0):0.11111):0.77778,Human-2-2:1.00000):0.00000;

And yes, I guess it'll look like edge diffs...

jeromekelleher commented 2 years ago

Hmmm - this really could be quite tricky. I bet there's internal assumptions about the ordering of nodes which are used to compute the source/dest IDs of the clades. I wonder if time is better spent elsewhere, since we can only show so many methods anyway?

awohns commented 2 years ago

You're probably right. Just to close the loop, this is what the output looks like with multiple trees.

ms and Sarge commands:

ms 4 1 -t 10.04 -r 100.0 10 > outputms.txt python2 utilities/ms2sarge_input.py -m outputms.txt -o test_sarge_input --num_bases 100 bin/sarge --sites test_sarge_input.sites --input test_sarge_input.geno.gz -o test_output_sarge cat test_output_sarge | bin/trees2newick -v output_test.haps > test_newick_output

The newick tree output looks like this:

seq 1 ((Human-1-2:0.50000,(Human-2-1:0,Human-2-2:0):0.00000):0.66667,Human-1-1:0.00000):0.00000; seq 13 ((Human-1-2:0.50000,(Human-2-1:0,Human-2-2:0):0.00000):0.66667,Human-1-1:0.00000):0.00000; seq 20 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 21 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 22 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 30 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 51 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 59 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 62 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 63 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 69 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 70 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 74 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 75 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 79 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 92 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 95 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 98 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20005,Human-1-1:0.00000):0.88886):0.00000;

I think there's a tree for every site by the way (hence all the duplicated trees).

The "recombination" file looks like this:

seq 13 20 1,2,3, 0,2,3, 2,3, seq 62 63 0,2,3, 0,1, 0,

The columns are chromosome, start position, end position, upstream clade, downstream clade, and moving clade.

jeromekelleher commented 2 years ago

It was definitely helpful to get the information here @awohns, so thanks for that. Totally up to you whether you want to pursue getting the conversion to work.

awohns commented 2 years ago

I agree that it likely won't be very productive it will be to pursue this further, so I'm happy to leave it!