Open lutteropp opened 3 years ago
I installed Aliview and made a screenshot of what the simulated MSA looks like... turns out 100 sites provide not enough signal for 10 taxa.
Maybe we should make the number of sites to simulate dependent on the number of taxa?
Things people wrote about this issue in the slack already (I had another dataset there, but the issue is the same):
benoit: if you want to increase the signal in the sequences without adding more sites, you can also check that there are enough (or not too many) substitutions in the simulated MSAs, and tune the substitution rates or simulation time accordingly
benoit: it's hard to tell from the raw values, I would really manually look at the simulated sequences. Also, depending on how your simulator work, you might not have direct control over the simulation time, but you can add a step (before running seq-gen) to normalize the (true) branch lenghts. Then you can tune this normalized value to increase/decrease the average number of substitutions.
benoit: another reason for not finding the true tree with raxml is to have one or a few very small branch lengths in the true tree (you can easily check that), making those regions of the tree very hard to resolve
benoit: I am sure @alexey knows better, but I would say that reaching the min branch length with 500 sites and 9 taxa indicates that raxml-ng has not enough information and arbitrarily/randomly resolved this part of the tree.
alexey: yes, how many sites out of 500 are non-invariant? do you get a star tree?
celine: Sarah, i did not tune at all the sequence simulation parameters, so it is possible that the sequences are not informative. You should post the alignment
Turns out the 100 MSA sites were a typo, the simulation setting we agreed upon had 500 or 1000 sites (https://github.com/lutteropp/NetRAX/issues/3#issuecomment-730342835)... But still, the same scaling issue is expected to happen - as soon as the number of taxa gets too large.
This issue is very low priority. I am printing the number of near-zero branches in raxml-ng best ML tree now in the results CSV file, turns out this happens very rarely. The issue might still raise again, if we increase the number of taxa by too much.
Hi,
I do not think it depends on the number of taxa but simply the length of the branches in the network (and thus of the displayed trees). If the problem appears again, we may want to boost a bit the substitutions using the following seq-gen option. I also attach the seq-gen manual because it is good you understand what you are simulating ;)
Seq-Gen: Simulation of molecular sequences.pdf
Scale branch lengths
This option allows the user to set a value with which to scale the branch lengths in order to make them equal the expected number of substitutions per site for each branch. Basically Seq-Gen multiplies each branch length by this value.
-s <SCALE>
Where
I agree with Celine
On 30.11.20 11:49, celinescornavacca wrote:
Hi,
I do not think it depends on the number of taxa but simply the length of the branches in the network. If the problem appears again, we may want to boost a bit the
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/11#issuecomment-735677153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6UZC3W5LLZFWEE3SLLSSNTB3ANCNFSM4UGFFUOQ.
-- Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
RAxML-NG gives:
WARNING: Best ML tree contains 2 near-zero branches!
The sequences were simulated with Seq-Gen using
cmd = 'seq-gen -mHKY -t3.0 -f0.3,0.2,0.2,0.3 -l' + str(dataset.msa_size)+'-p' + str(dataset.n_trees)+' < ' + dataset.extracted_trees_path + ' > ' + dataset.msa_path
The "true", simulated tree (I had to change its file ending): 0_10_taxa_0_reticulations_SamplingType.PERFECT_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_100_msasize_true_network.nw.txt
The MSA file: 0_10_taxa_0_reticulations_SamplingType.PERFECT_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_100_msasize_msa.txt
The RAxML-NG log file: 0_10_taxa_0_reticulations_SamplingType.PERFECT_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_100_msasize.raxml.log
The tree inferred by RAxML-NG (I had to change its file ending): 0_10_taxa_0_reticulations_SamplingType.PERFECT_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_100_msasize.raxml.bestTree.txt
The tree inferred by RAxML-NG, with near-zero branch lengths being collapsed (I had to change its file ending): 0_10_taxa_0_reticulations_SamplingType.PERFECT_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_100_msasize.raxml.bestTreeCollapsed.txt