rambaut / Seq-Gen

Sequence simulator
51 stars 17 forks source link

Error reading tree number 1: Closing bracket missing. #18

Open lskatz opened 3 years ago

lskatz commented 3 years ago

Similar to #9 but I can't solve it with regex. I downloaded the nextstrain tree (Jan 4, 2021) for nCov and wanted to run TreeToReads.py with it (newick attached below). However the seq-gen part gives the closing bracket error. I have tried a variety of things including renaming the taxa and resolving multifurcations

perl -MBio::TreeIO -e '$tree=Bio::TreeIO->new(-file=>"nextstrain_ncov_global_tree.nwk")->next_tree; for($tree->get_nodes){$i++; if($_->is_Leaf){$_->id("TAXON$i");} else {$_->id("");} } print $tree->as_text("newick")."\n";' | gotree resolve > anonymized.nwk

And breaking apart long lines

cat anonymized.nwk | perl -plane 's/(.{50,}?,)/\1\n/g' > tmp.nwk

This is my seq-gen command (and change the stdin parameter accordingly)

seq-gen -l768000 -n1 -mGTR -a5.0 -r0.25,0.82,0.15,0.27,2.99,1.00 -f0.299236590102,0.183687135874,0.196176253934,0.32090002009 -or < tmp.nwk

But nothing seems to help so far. Any ideas?

nextstrain_ncov_global_tree.zip

lskatz commented 3 years ago

I also tried to open it in Figtree which worked. Then saving it as nexus. Then converting it back to newick. This newick file also did not work in seq-gen.

lskatz commented 3 years ago

Tried a different tree file just to be sure the issue is with the tree and I think it is

[gzu2@monolith3 nextstrain-2020-01-04]$ echo '(A:0.1,B:0.1,C:0.1);' > tmp.nwk
[gzu2@monolith3 nextstrain-2020-01-04]$ seq-gen -l768000 -n1 -mGTR -a5.0 -r0.25,0.82,0.15,0.27,2.99,1.00 -f0.299236590102,0.183687135874,0.196176253934,0.32090002009 -or < tmp.nwk  | goalign stats --auto-detect
Sequence Generator - seq-gen
Version 1.3.4
(c) Copyright, 1996-2017 Andrew Rambaut and Nick Grassly
Institute of Evolutionary Biology, University of Edinburgh

Originally developed at:
Department of Zoology, University of Oxford

Random number generator seed: -1600840904040947534

Simulations of 3 taxa, 768000 nucleotides
  for 1 tree(s) with 1 dataset(s) per tree

Branch lengths assumed to be number of substitutions per site

Continuous gamma rate heterogeneity:
    shape = 5.000000
Model = GTR: General time reversible (nucleotides)
  Rate of transitions and transversions equal:
  rate matrix = gamma1: 0.2500 alpha1: 0.8200  beta1: 0.1500
                                beta2: 0.2700 alpha2: 2.9900
                                              gamma2:  1.0000
  with nucleotide frequencies specified as:
  A=0.299237 C=0.183687 G=0.196176 T=0.3209

Time taken: 0.35 seconds
length  768000
nseqs   3
avgalleles      1.2478
variable sites  181432
char    nb      freq
A       688699  0.298914
C       423453  0.183790
G       452789  0.196523
T       739059  0.320772
alphabet        nucleotide
lskatz commented 3 years ago

seq-gen does not like some variations in Newick! This fixed my tree. I think it was some combination that seq-gen needed to be fixed: