lutteropp / NetRAX

Phylogenetic Network Inference without ILS
GNU General Public License v3.0
17 stars 1 forks source link

Verification of the BIC Computation #22

Closed lutteropp closed 3 years ago

lutteropp commented 3 years ago

Because of problems that emerged with the "small_network" experiments (BIC-difference questions), here is the new GitHub issue for verifying the BIC-Scores...

I will post BIC-score information (plus additional data needed for the verification) from a NetRAX run here, where we can be sure that the only difference between the 0-reticulation-network and the 1-reticulation-network lies in a single arc-insertion move that has been performed.

celinescornavacca commented 3 years ago

Great, with the number of free parameters in hand, I will be able to understand a bit more. It would be great to have the detail for both the tree and the network of all these parameters:

multiplier network.num_branches ann_network.total_num_model_parameters ann_network.total_num_sites ann_network.network.num_tips

lutteropp commented 3 years ago

Here is the entire command line output of an experiment (on commit version https://github.com/lutteropp/NetRAX/commit/9e66661a83ca2a8d71953d905ade647a735d2b10) I just ran now. I simulated a network with 4 taxa and 1 reticulation, and started the network search from the best raxml-ng tree. cmd_output_4_taxa_1_reticulation_debug_experiment.txt

With SamplingType.PERFECT_SAMPLING (which takes into account reticulation probabilities), SeqGen simulated two partitions of unequal size:

Simulations of 4 taxa, 2001 nucleotides
  for 2 tree(s) with 1 dataset(s) per tree
  and 2 partitions (and trees) per dataset
    Partition  No. Sites  Relative Rate
       1          1824    1.000000
       2           177    1.000000

With SamplingType.PERFECT_UNIFORM_SAMPLING (which ignores retiulation probabilities), SeqGen simulated two partitions of equal size:

Simulations of 4 taxa, 2000 nucleotides
  for 2 tree(s) with 1 dataset(s) per tree
  and 2 partitions (and trees) per dataset
    Partition  No. Sites  Relative Rate
       1          1000    1.000000
       2          1000    1.000000

I extracted some lines of this large file, which should be of interest here: (line 2779 to 2802, here I used SamplingType.PERFECT_SAMPLING and LikelihoodType.AVERAGE)

Extensive BIC info before applying current ArcInsertionMove move 13/ 26:
 multiplier: 1
 network.num_reticulations: 0
 network.num_branches: 6
 ann_network.total_num_model_parameters: 9
 ann_network.total_num_sites: 2001
 ann_network.network.num_tips: 4
 number of partitions in the MSA: 1
 sample_size n: 8004
 param_count k: 15
 logl: -5670.26
 bic: 11475.3
Extensive BIC info after applying current ArcInsertionMove move 13/ 26:
 multiplier: 1
 network.num_reticulations: 1
 network.num_branches: 9
 ann_network.total_num_model_parameters: 9
 ann_network.total_num_sites: 2001
 ann_network.network.num_tips: 4
 number of partitions in the MSA: 1
 sample_size n: 8004
 param_count k: 18
 logl: -5670.95
 bic: 11503.7

(line 5284 to 5307, here used SamplingType.PERFECT_SAMPLING and LikelihoodType.BEST)

Extensive BIC info before applying current ArcInsertionMove move 26/ 26:
 multiplier: 1
 network.num_reticulations: 0
 network.num_branches: 6
 ann_network.total_num_model_parameters: 9
 ann_network.total_num_sites: 2001
 ann_network.network.num_tips: 4
 number of partitions in the MSA: 1
 sample_size n: 8004
 param_count k: 15
 logl: -5476.83
 bic: 11088.5
Extensive BIC info after applying current ArcInsertionMove move 26/ 26:
 multiplier: 1
 network.num_reticulations: 1
 network.num_branches: 9
 ann_network.total_num_model_parameters: 9
 ann_network.total_num_sites: 2001
 ann_network.network.num_tips: 4
 number of partitions in the MSA: 1
 sample_size n: 8004
 param_count k: 18
 logl: -5477.52
 bic: 11116.8

(line 7237 to 7260, here I used SamplingType.PERFECT_UNIFORM_SAMPLING and LikelihoodType.AVERAGE)

Extensive BIC info before applying current ArcInsertionMove move 26/ 26:
 multiplier: 1
 network.num_reticulations: 0
 network.num_branches: 6
 ann_network.total_num_model_parameters: 9
 ann_network.total_num_sites: 2000
 ann_network.network.num_tips: 4
 number of partitions in the MSA: 1
 sample_size n: 8000
 param_count k: 15
 logl: -5451.47
 bic: 11037.7
Extensive BIC info after applying current ArcInsertionMove move 26/ 26:
 multiplier: 1
 network.num_reticulations: 1
 network.num_branches: 9
 ann_network.total_num_model_parameters: 9
 ann_network.total_num_sites: 2000
 ann_network.network.num_tips: 4
 number of partitions in the MSA: 1
 sample_size n: 8000
 param_count k: 18
 logl: -5452.16
 bic: 11066.1

(line 9190 to 9213, here I used SamplingType.PERFECT_UNIFORM_SAMPLING and LikelihoodType.BEST)

Extensive BIC info before applying current ArcInsertionMove move 26/ 26:
 multiplier: 1
 network.num_reticulations: 0
 network.num_branches: 6
 ann_network.total_num_model_parameters: 9
 ann_network.total_num_sites: 2000
 ann_network.network.num_tips: 4
 number of partitions in the MSA: 1
 sample_size n: 8000
 param_count k: 15
 logl: -5366.1
 bic: 10867
Extensive BIC info after applying current ArcInsertionMove move 26/ 26:
 multiplier: 1
 network.num_reticulations: 1
 network.num_branches: 9
 ann_network.total_num_model_parameters: 9
 ann_network.total_num_sites: 2000
 ann_network.network.num_tips: 4
 number of partitions in the MSA: 1
 sample_size n: 8000
 param_count k: 18
 logl: -5366.79
 bic: 10895.4
lutteropp commented 3 years ago

This run let to the discovery of two bugs:

Anyway, the reported data I posted above shows that without having partitions in the MSA, the BIC does not support any reticulations... (an observation which is obvious for LikelihoodType.BEST, but less obvious for LikelihoodType.AVERAGE)

EDIT: At least for LikelihoodType.AVERAGE, the issue could also be some bug in the loglikelihood computation here. Need to double-check this.

lutteropp commented 3 years ago

Cross-reference to how BIC is currently computed: https://github.com/lutteropp/NetRAX/issues/19#issuecomment-736502937

lutteropp commented 3 years ago

I am still surprised about why the loglikelihood score got worse after trying an arc insertion. Shouldn't more reticulations always lead to a better-or-equal loglikelihood score?

celinescornavacca commented 3 years ago

Let' s wait for the partition bug to be out of our way before trying to understand this arc insertion problem, OK?

stamatak commented 3 years ago

I'd agree with Celine, my guess is that the key factor here is that br-lens are not optimized individually on a per-partition basis

alexis

On 04.12.20 11:39, celinescornavacca wrote:

Let' s wait for the partition bug to be out of our way before trying to understand this arc insertion problem, OK?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/22#issuecomment-738680723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6XEUTJ77RRVECMTAQ3STCU4HANCNFSM4UJDHAAA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

lutteropp commented 3 years ago

I found a mistake in the BIC computation: I had confused unlinked branch length mode and linked/scaled branch length mode. The multiplier should be 1 if we have linked/scaled branch length mode. If we have unlinked branch length mode, then the multiplier should be equal to the number of partitions.

I fixed this in https://github.com/lutteropp/NetRAX/commit/753d08ad5b8a813715a25e94a571153de5605ac6. It shouldn't have affected much though...

lutteropp commented 3 years ago

Alexey explained to me how Branch-length scalers work. It seems to me that if we have scaled branch lengths, the correct number of free parameters for the branch lengths should be number_of_branches + number_of_partitions. Because in scaled branch lengths mode, each partition as its own scaling factor that it uses to scale the otherwise shared branch lengths.

lutteropp commented 3 years ago

I fixed the computation of param_count for the BIC computation in https://github.com/lutteropp/NetRAX/commit/402fd046c5cae7e6aa927fbadb70fa24bee356fc. The corrected version is:

size_t get_param_count(AnnotatedNetwork& ann_network) {
    Network &network = ann_network.network;

    size_t param_count = ann_network.total_num_model_parameters;
    if (ann_network.fake_treeinfo->brlen_linkage == PLLMOD_COMMON_BRLEN_UNLINKED) {
        param_count += ann_network.fake_treeinfo->partition_count * network.num_branches();
        if (!ann_network.options.use_nepal_prob_estimation) { // reticulation probs as free parameters
            param_count += ann_network.fake_treeinfo->partition_count * ann_network.network.num_reticulations();
        }
    } else { // branch lengths are shared among partitions
        param_count += network.num_branches();
        if (!ann_network.options.use_nepal_prob_estimation) { // reticulation probs as free parameters
            param_count += ann_network.network.num_reticulations();
        }
        if (ann_network.fake_treeinfo->brlen_linkage == PLLMOD_COMMON_BRLEN_SCALED) {
            // each partition can scale the branch lengths by its own scaling factor
            param_count += ann_network.fake_treeinfo->partition_count - 1;
        }
    }
    return param_count;
}
lutteropp commented 3 years ago

I believe the BIC issue is fixed now. With the corrected param_count k, correct handling of partitioned MSAs, and the changed optimization routine for reticulation probabilities, the BIC for a simulated 4-taxon-network with 1 reticulation is now better than the BIC I get for the zero-reticulation best tree inferred by raxml-ng. It also appears that the MSA is fine.

There is still another problem in the NetRAX search (which leads to sometimes failed assertions, sometimes just inferred networks that score badly). I already have a clear idea of what is likely going wrong and am investigating it: https://github.com/lutteropp/NetRAX/issues/26

stamatak commented 3 years ago

Alexey's explanation is correct from my point of view

On 06.12.20 02:07, Sarah Lutteropp wrote:

Alexey explained to me how Branch-length scalers work. It seems to me that if we have scaled branch lengths, the correct number of free parameters for the branch lengths should be number_of_branches + number_of_partitions. Because in scaled branch lengths mode, each partition as its own scaling factor that it uses to scale the otherwise shared branch lengths.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/22#issuecomment-739433739, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6R67IEHEXIACDOI4H3STLDLFANCNFSM4UJDHAAA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

celinescornavacca commented 3 years ago

I agree with the computation for the linked and scaled modes, but I think that the gammas should be shared by all partitions in the unlinked mode.

stamatak commented 3 years ago

the gamma for rate heterogeneity? not necessarily, typically you asssume as separate gamma distribution for each partition as this only induces one extra parameter per partition,

alexis

On 07.12.20 11:01, celinescornavacca wrote:

I agree with the computation for the linked and scaled modes, but I think that the gammas should be shared by all partitions in the unlinked mode.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/22#issuecomment-739778130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6SIV3JTSVYUNO6TDULSTSKXJANCNFSM4UJDHAAA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

lutteropp commented 3 years ago

When @celinescornavacca speaks about "the gammas", she means the reticulation probabilities

celinescornavacca commented 3 years ago

Yes, I do not agree with

ann_network.fake_treeinfo->partition_count * ann_network.network.num_reticulations()

stamatak commented 3 years ago

ah, okay ... now it makes sense, and yes they should be linked I believe, i.e., one set of reticul. probs. for the entire dataset

On 07.12.20 12:07, Sarah Lutteropp wrote:

I think when @celinescornavacca https://github.com/celinescornavacca speaks about "the gammas", she means the reticulation probabilities

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/22#issuecomment-739815081, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6QE6XSA63OBSWKDCWDSTSSPLANCNFSM4UJDHAAA.

-- Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

lutteropp commented 3 years ago

Alright! I changed it in https://github.com/lutteropp/NetRAX/commit/ae12ca68eb9be8d3aff1ee31e50d37a147c4c94a. Reticulation probabilities are partition-independent now, no matter which brlen linkage mode is used.

lutteropp commented 3 years ago

Looks like we can close this issue now. BIC is behaving the way it should now, supporting the reticulations in the simulated datasets and everything.