Discussion of Experimental Results for "small_network" (4-10 taxa, 1-2 reticulations)

lutteropp commented 3 years ago

The entire data (including the plots) is in here: https://drive.google.com/drive/folders/1l6lOPbiyjDKi9bgTqGwD6zXzjvny_pCZ?usp=sharing

The experiments were done on my own laptop (LG Gram 17, 2019 model) using this commit version: https://github.com/lutteropp/NetRAX/commit/e2cf9eb072714633cea0b495ef8eec09862041ea.

I decided against re-uploading each plot in this issue, as it made a messy message and did not provide any additional insight than the plots folder in the GoogleDrive link. Also, I am still adding new plots and improving the plots.

lutteropp commented 3 years ago

Observations from this experiment:

NetRAX does similarly well as raxml-ng.
On all datasets, raxml-ng inferred no near-zero-branches in its ML tree.
The BIC score does not see enough signal for supporting any reticulations in the simulated data. Thus, NetRAX inferred zero reticulations for all datasets, while scoring a better BIC than the BIC of the ground-truth simulated network.

Looks like we need to tweak the MSA/sequence simulation somehow.

lutteropp commented 3 years ago

I am also not 100% excluding the possibility of a bug in the partitioned MSA creation (when giving it multiple trees) with SeqGen.

I will do an alternative experiment to verify the current MSA generation setup. This alternative experiment consists of:

Create MSAs for each displayed tree in separate SeqGen calls.
Merge these MSAs into one big partitioned MSA by hand.
Re-run NetRAX on such a dataset.
Verify that again the BIC sees no supporting signal for reticulations.

celinescornavacca commented 3 years ago

Can you post here your definition of BIC you use here? BIC = -2 lnL + k ln n

what are k and n in here? (We discussed it in slack but I cannot find it!) k is the branch in the network and n=number of sites * number of taxa?

So for a reticulation (k=1), 1000 sites and 5 taxa
diff BIC (tree, network) = -2 lnL_tree + 2 lnL_net - k ln n = -2 (lnL_tree - lnL_net ) - ln(5000)

where ln(5000) = 8.517193191416238

I cannot check if anything is wrong, because I only have the relative diff of lnL. If the diff between the lnL of tree and network is too small, then we cannot favour the network.

lutteropp commented 3 years ago

@celinescornavacca

double bic(double logl, double k, double n) {
    return -2 * logl + k * log(n);
}

logl = the loglikelihood of the network.
k = get_param_count(ann_network)
n = get_sample_size(ann_network)

The number of parameters k is: (number of parameters in the likelihood model) + (number of branches) * multiplier, where multiplier is 1 if the partitions share the same branch lengths and number_of_partitions otherwise. See this code for its computation:

size_t get_param_count(AnnotatedNetwork& ann_network) {
    Network &network = ann_network.network;
    bool unlinked_mode = (ann_network.fake_treeinfo->brlen_linkage == PLLMOD_COMMON_BRLEN_UNLINKED);
    size_t multiplier = (unlinked_mode) ? 1 : ann_network.fake_treeinfo->partition_count;
    size_t param_count = multiplier * network.num_branches()
            + ann_network.total_num_model_parameters;
    return param_count;
}

The number of parameters in the likelihood model is some number I get from libpll. It is ann_network.total_num_model_parameters = instance.parted_msa->total_free_model_params(); which reduces to the sum of free model parameters among all partitions.

The sample size is the number of sites in the MSA times the number of taxa, see:

size_t get_sample_size(AnnotatedNetwork& ann_network) {
return ann_network.total_num_sites * ann_network.network.num_tips();
}

The raw BIC and logl values for all runs can be found in this table: https://drive.google.com/file/d/1His_wi_i0Dz3E9ye6HFz6b1dhdMhnlZ8/view?usp=sharing

lutteropp commented 3 years ago

I did some digging and searching in the slack...

Some maybe relevant old Slack messages from @stamatak and @celinescornavacca about computing the number of parameters: @stamatak said: This is more complicated as some standard phylogenetic model tests take br-lens into account as free parameters and some don't, I believe in our case we should apply both strategies, that is, take br-lens into account and then also just ignore them @celinescornavacca said: If the parameters of the GTR model stay the same, the cancel out in the delta BIC

Back then we also decided to not count the reticulation probabilities (the "gammas", as Celine called them in the slack message) as free parameters, since we estimate them similar to how the NEPAL paper does it.

lutteropp commented 3 years ago

@celinescornavacca Here are some example values from the CSV (https://drive.google.com/file/d/1His_wi_i0Dz3E9ye6HFz6b1dhdMhnlZ8/view?usp=sharing), such that you don't have to open the table yourself:

for dataset datasets_small_network/0_5_taxa_1_reticulations_SamplingType.PERFECT_UNIFORM_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_1000_msasize and running NetRAX starting from best raxml-ng tree (row 7 in the CSV file):
n_reticulations: 1
bic_true: 8146.9
logl_true: -3988.28
n_reticulations_inferred: 0
bic_inferred: 8109.29
logl_inferred: -3982.25

Mhmmm... interesting. Already the loglikelihood of the inferred 0-reticulation network is better than the loglikelihood of the simulated 1-reticulation network. This can happen due to different brlen optimization, different model optimization, or the simulated MSA supporting a different topology than the "true" one.

(EDIT: Moved problem with LikelihoodType.BEST to a new GitHub issue)

lutteropp commented 3 years ago

I moved some observations about LikelihoodType.BEST to a new GitHub issue, to be please discussed there and not in this issue here: https://github.com/lutteropp/NetRAX/issues/20

lutteropp commented 3 years ago

At least it is pretty obvious now that the problem does not have to do with the BIC.

Instead, it gets more and more clear that something is fishy with the simulated MSA...

celinescornavacca commented 3 years ago

I think that there is something strange here. As I wrote once in the slack, the diff of BIC between a tree and a network is the following:

diff BIC (tree, network) = -2 lnL_tree + 2 lnL_net - k ln n = -2 (lnL_tree - lnL_net ) - k ln(MSA size)

where k is the number of parameters in the network model which are not in the tree model. We do not consider inheritance probs, so k should 3 for an arc insertion (3 new branches to optimize) and 5 for a reticulation insertion (5 new branches to optimize). But I am far from getting this (see columns S-V I added in the file you sent me)

Unfortunately, we did not put the number of free parameters in the summary file, nor we put the inheritance probabilities (if the inheritance probabilities are 0.07 and 0.93, it is a bit normal that we do not find the network). I am afraid we need this information, but I would wait for @stamatak to give his opinion too.

lutteropp commented 3 years ago

@celinescornavacca I cannot see any newly added columns in the CSV file.

What is a reticulation insertion? I thought an arc insertion introduces a new reticulation (which means, 1 new node and 3 new edges). I am not aware of any "reticulation insertion" move then...

Unfortunately, we did not put the number of free parameters in the summary file, nor we put the inheritance probabilities (if the inheritance probabilities are 0.07 and 0.93, it is a bit normal that we do not find the network).

Well it's easy to get that information, I'll just initiate a new single NetRAX run on one of the datasets, with adding some more debug output printing of the values you want. After all, I have kept all the raw data here, including simulated topologies and simulated MSAs!

... But still, this is not a BIC issue. Just look at the loglikelihood values! The inferred 0-reticulation network has a better loglikelihood than the "true" 1-reticulation network!

lutteropp commented 3 years ago

These are the only number-of-reticulations-changing moves I implemented in NetRAX Screenshot from 2020-12-01 14-48-24

lutteropp commented 3 years ago

Also, I still want to double-check if the MSA is generated correctly (using the methodology I proposed here: https://github.com/lutteropp/NetRAX/issues/19#issuecomment-736420010). Because, even if I choose SamplingType.PERFECT_UNIFORM_SAMPLING (which essentially overwrites all reticulation probabilities to be 0.5 before simulating the sequences), we end up with an inferred tree with better loglikelihood than the simulated network...

lutteropp commented 3 years ago

It could very well be that the MSA currently generated with calling SeqGen is saying "ditch this simulated network with its multiple displayed trees, I am deciding to support only a single tree topology with my data" - either through a bug or through not enough information in the sequences!

Another verification experiment I will try:

Use raxml-ng on each partition of the MSA generated by seq-gen
For each raxml-ng-on-partition-belonging-to-a-displayed-tree run, check if raxml-ng indeed inferred that displayed tree.

celinescornavacca commented 3 years ago

Just look at the loglikelihood values! The inferred 0-reticulation network has a better loglikelihood than the "true" 1-reticulation network!

This can be due to the inheritance probabilities, as I said before, so we need them too (if the network is almost a tree because it has one inheritance prob near to zero, this can happen).

... But still, this is not a BIC issue.

Still, the BIC diff is not what I expect, since you confirmed that k=3 and we do not have this in the CVS file

lutteropp commented 3 years ago

This can be due to the inheritance probabilities

not in the SamplingType.PERFECT_UNIFORM_SAMPLING case, where the reticulation probabilities are ignored and each reticulation is made equally likely before simulating the MSA...

Anyway... Let's split this into 2 new GitHub issues, to tidy up the discussion here. :-)

One GitHub issue for verifying that everything is fine with the MSA.
One GitHub issue for verifying that everything is fine with the BIC.

I am creating those GitHub issues now.

lutteropp commented 3 years ago

I have created a GitHub issue for the MSA discussion: https://github.com/lutteropp/NetRAX/issues/21 And here is a GitHub issue for the BIC dicussion: https://github.com/lutteropp/NetRAX/issues/22

But one note to the BIC discussion from here...: The logl/BIC value of the 0-reticulation inferred network and of the 1-reticulation simulated network I posted earlier are not easily comparable. Their topologies may differ by more than just 1 arc insertion!

celinescornavacca commented 3 years ago

But not their number of parameters

On December 1, 2020 3:27:42 PM GMT+01:00, Sarah Lutteropp notifications@github.com wrote:

I have created a GitHub issue for the MSA discussion: https://github.com/lutteropp/NetRAX/issues/21 And here is a GitHub issue for the BIC dicussion: https://github.com/lutteropp/NetRAX/issues/22

But one note to the BIC discussion from here...: The logl/BIC value of the 0-reticulation inferred network and of the 1-reticulation simuated network are not easily comparable. Their topologies may differ by more than just 1 more reticulation!

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/lutteropp/NetRAX/issues/19#issuecomment-736585820

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

stamatak commented 3 years ago

we should make the MSAs longer to see if this reduced the difference between logl_true and logl_inferred in our next face to face we should also discuss the number of free parameters again, as I stated back then, there are different ways of counting them

On 01.12.20 14:16, Sarah Lutteropp wrote:

@celinescornavacca https://github.com/celinescornavacca Here are some example values from the CSV, such that you don't have to open the table yourself:

for dataset datasets_small_network/0_5_taxa_1_reticulations_SamplingType.PERFECT_UNIFORM_SAMPLING_sampling_LikelihoodType.AVERAGE_likelihood_1000_msasize and running NetRAX starting from best raxml-ng tree (row 7 in the CSV file):

n_reticulations: 1

bic_true: 8146.9

logl_true: -3988.28

n_reticulations_inferred: 0

bic_inferred: 8109.29

logl_inferred: -3982.25

Mhmmm... interesting. Already the loglikelihood of the inferred 0-reticulation network is better than the loglikelihood of the simulated 1-reticulation network. This can happen due to different brlen optimization, different model optimization, or the simulated MSA supporting a different topology than the "true" one.

By the way, this will nearly always be the case if we use LikelihoodModel.BEST, as there we only use the loglikelihood of the best displayed tree in the network, made smaller by the probability of that displayed tree!!! This kinda tells us that LikelihoodModel.BEST NEVER makes sense in phylogenetic network inference if we use BIC.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/19#issuecomment-736515669, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6VJNRIWSCW6JJFQGTLSSTNDPANCNFSM4UI2SMHA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

lutteropp / NetRAX

Discussion of Experimental Results for "small_network" (4-10 taxa, 1-2 reticulations) #19