Closed awohns closed 2 years ago
Sounds good - can you paste in some information about the SARGE format (and maybe an example) here for discussion please?
@awohns - we discussed this the other day and I think the decision was that converting the SARGE output to an ARG would be too difficult, as they don't provide anything that's particularly ARG-like?
Sorry for the slow movement here @jeromekelleher. I just had a look through the supplementary methods and it looks like SARGE does, in fact, infer recombination nodes: these figures from the SM are the relevant ones. So I'll have another go at running the method (on a linux machine) and report back shortly.
Ok, so making progress: the following lines run Sarge on a single tree from mspms
mspms 100 1 --mutation-rate 1e-4 --population-size 1 10000 -T --random-seed 1 2 3 > testoutput.ms python2 utilities/ms2sarge_input.py -m testoutput.ms -o test_sarge_input --num_bases 100 bin/sarge --sites test_sarge_input.sites --input test_sarge_input.geno.gz -o test_output_sarge bin/index output.sarge seq:1-100 | bin/trees2newick -v output_test.haps > test_newick_output
I think the best plan here is to take two output files: newick tree and the recomb
file:
which is a tab separated file of chromosome, start position, end position, upstream clade, downstream clade, and moving clade.
and then make an ARG out of it with tskit
Currently having an issue when using mspms
with recombination, see the issue I've made in the Sarge repo, but once this is solved I make a conversion script.
Sounds great @awohns! Are the internal nodes in the SARGE newick labelled? Sounds like we're going to be constructing a tree sequence then from (essentially) a set of edge diffs?
Here's output for a single tree, doesn't look like it internal nodes are labeled
((Human-1-1:0.22222,(Human-1-2:0.11111,Human-2-1:0):0.11111):0.77778,Human-2-2:1.00000):0.00000;
And yes, I guess it'll look like edge diffs...
Hmmm - this really could be quite tricky. I bet there's internal assumptions about the ordering of nodes which are used to compute the source/dest IDs of the clades. I wonder if time is better spent elsewhere, since we can only show so many methods anyway?
You're probably right. Just to close the loop, this is what the output looks like with multiple trees.
ms and Sarge commands:
ms 4 1 -t 10.04 -r 100.0 10 > outputms.txt python2 utilities/ms2sarge_input.py -m outputms.txt -o test_sarge_input --num_bases 100 bin/sarge --sites test_sarge_input.sites --input test_sarge_input.geno.gz -o test_output_sarge cat test_output_sarge | bin/trees2newick -v output_test.haps > test_newick_output
The newick tree output looks like this:
seq 1 ((Human-1-2:0.50000,(Human-2-1:0,Human-2-2:0):0.00000):0.66667,Human-1-1:0.00000):0.00000; seq 13 ((Human-1-2:0.50000,(Human-2-1:0,Human-2-2:0):0.00000):0.66667,Human-1-1:0.00000):0.00000; seq 20 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 21 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 22 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 30 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 51 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 59 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 62 (Human-1-2:1.00000,((Human-2-1:0,Human-2-2:0):0.16667,Human-1-1:0.00000):0.90909):0.00000; seq 63 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 69 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 70 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 74 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 75 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 79 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 92 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 95 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20000,Human-1-1:0.00000):0.88889):0.00000; seq 98 ((Human-2-1:0,Human-2-2:0):1.00000,(Human-1-2:0.20005,Human-1-1:0.00000):0.88886):0.00000;
I think there's a tree for every site by the way (hence all the duplicated trees).
The "recombination" file looks like this:
seq 13 20 1,2,3, 0,2,3, 2,3, seq 62 63 0,2,3, 0,1, 0,
The columns are chromosome, start position, end position, upstream clade, downstream clade, and moving clade.
It was definitely helpful to get the information here @awohns, so thanks for that. Totally up to you whether you want to pursue getting the conversion to work.
I agree that it likely won't be very productive it will be to pursue this further, so I'm happy to leave it!
I imagine it would be useful to have a means of converting genealogies inferred by sarge to the tree sequence format for the figure comparing the output of inference methods? I don't think this has been done by anyone (doesn't look like it's in the sarge repo) so I'm happy to have a go at it.