lutteropp / NetRAX

Phylogenetic Network Inference without ILS
GNU General Public License v3.0
17 stars 1 forks source link

100 simulated MSA sites on 10-taxon-tree is not enough for RAxML #11

Open lutteropp opened 3 years ago

lutteropp commented 3 years ago

RAxML-NG gives: WARNING: Best ML tree contains 2 near-zero branches!

The sequences were simulated with Seq-Gen using cmd = 'seq-gen -mHKY -t3.0 -f0.3,0.2,0.2,0.3 -l' + str(dataset.msa_size)+'-p' + str(dataset.n_trees)+' < ' + dataset.extracted_trees_path + ' > ' + dataset.msa_path

The "true", simulated tree (I had to change its file ending): 0_10_taxa_0_reticulations_SamplingType.PERFECT_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_100_msasize_true_network.nw.txt

The MSA file: 0_10_taxa_0_reticulations_SamplingType.PERFECT_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_100_msasize_msa.txt

The RAxML-NG log file: 0_10_taxa_0_reticulations_SamplingType.PERFECT_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_100_msasize.raxml.log

The tree inferred by RAxML-NG (I had to change its file ending): 0_10_taxa_0_reticulations_SamplingType.PERFECT_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_100_msasize.raxml.bestTree.txt

The tree inferred by RAxML-NG, with near-zero branch lengths being collapsed (I had to change its file ending): 0_10_taxa_0_reticulations_SamplingType.PERFECT_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_100_msasize.raxml.bestTreeCollapsed.txt

lutteropp commented 3 years ago

Screenshot from 2020-11-29 02-29-36 I installed Aliview and made a screenshot of what the simulated MSA looks like... turns out 100 sites provide not enough signal for 10 taxa.

Maybe we should make the number of sites to simulate dependent on the number of taxa?

lutteropp commented 3 years ago

Things people wrote about this issue in the slack already (I had another dataset there, but the issue is the same):

benoit: if you want to increase the signal in the sequences without adding more sites, you can also check that there are enough (or not too many) substitutions in the simulated MSAs, and tune the substitution rates or simulation time accordingly

benoit: it's hard to tell from the raw values, I would really manually look at the simulated sequences. Also, depending on how your simulator work, you might not have direct control over the simulation time, but you can add a step (before running seq-gen) to normalize the (true) branch lenghts. Then you can tune this normalized value to increase/decrease the average number of substitutions.

benoit: another reason for not finding the true tree with raxml is to have one or a few very small branch lengths in the true tree (you can easily check that), making those regions of the tree very hard to resolve

benoit: I am sure @alexey knows better, but I would say that reaching the min branch length with 500 sites and 9 taxa indicates that raxml-ng has not enough information and arbitrarily/randomly resolved this part of the tree.

alexey: yes, how many sites out of 500 are non-invariant? do you get a star tree?

celine: Sarah, i did not tune at all the sequence simulation parameters, so it is possible that the sequences are not informative. You should post the alignment

lutteropp commented 3 years ago

Turns out the 100 MSA sites were a typo, the simulation setting we agreed upon had 500 or 1000 sites (https://github.com/lutteropp/NetRAX/issues/3#issuecomment-730342835)... But still, the same scaling issue is expected to happen - as soon as the number of taxa gets too large.

lutteropp commented 3 years ago

This issue is very low priority. I am printing the number of near-zero branches in raxml-ng best ML tree now in the results CSV file, turns out this happens very rarely. The issue might still raise again, if we increase the number of taxa by too much.

celinescornavacca commented 3 years ago

Hi,

I do not think it depends on the number of taxa but simply the length of the branches in the network (and thus of the displayed trees). If the problem appears again, we may want to boost a bit the substitutions using the following seq-gen option. I also attach the seq-gen manual because it is good you understand what you are simulating ;)

Seq-Gen: Simulation of molecular sequences.pdf

Scale branch lengths

This option allows the user to set a value with which to scale the branch lengths in order to make them equal the expected number of substitutions per site for each branch. Basically Seq-Gen multiplies each branch length by this value.

-s <SCALE>

Where is a decimal number greater than zero. For example if you give an value of 0.5 then each branch length would be halved before using it to simulate the sequences.

stamatak commented 3 years ago

I agree with Celine

On 30.11.20 11:49, celinescornavacca wrote:

Hi,

I do not think it depends on the number of taxa but simply the length of the branches in the network. If the problem appears again, we may want to boost a bit the

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/11#issuecomment-735677153, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6UZC3W5LLZFWEE3SLLSSNTB3ANCNFSM4UGFFUOQ.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org