Closed lutteropp closed 3 years ago
Great, with the number of free parameters in hand, I will be able to understand a bit more. It would be great to have the detail for both the tree and the network of all these parameters:
multiplier network.num_branches ann_network.total_num_model_parameters ann_network.total_num_sites ann_network.network.num_tips
Here is the entire command line output of an experiment (on commit version https://github.com/lutteropp/NetRAX/commit/9e66661a83ca2a8d71953d905ade647a735d2b10) I just ran now. I simulated a network with 4 taxa and 1 reticulation, and started the network search from the best raxml-ng tree. cmd_output_4_taxa_1_reticulation_debug_experiment.txt
With SamplingType.PERFECT_SAMPLING (which takes into account reticulation probabilities), SeqGen simulated two partitions of unequal size:
Simulations of 4 taxa, 2001 nucleotides
for 2 tree(s) with 1 dataset(s) per tree
and 2 partitions (and trees) per dataset
Partition No. Sites Relative Rate
1 1824 1.000000
2 177 1.000000
With SamplingType.PERFECT_UNIFORM_SAMPLING (which ignores retiulation probabilities), SeqGen simulated two partitions of equal size:
Simulations of 4 taxa, 2000 nucleotides
for 2 tree(s) with 1 dataset(s) per tree
and 2 partitions (and trees) per dataset
Partition No. Sites Relative Rate
1 1000 1.000000
2 1000 1.000000
I extracted some lines of this large file, which should be of interest here: (line 2779 to 2802, here I used SamplingType.PERFECT_SAMPLING and LikelihoodType.AVERAGE)
Extensive BIC info before applying current ArcInsertionMove move 13/ 26:
multiplier: 1
network.num_reticulations: 0
network.num_branches: 6
ann_network.total_num_model_parameters: 9
ann_network.total_num_sites: 2001
ann_network.network.num_tips: 4
number of partitions in the MSA: 1
sample_size n: 8004
param_count k: 15
logl: -5670.26
bic: 11475.3
Extensive BIC info after applying current ArcInsertionMove move 13/ 26:
multiplier: 1
network.num_reticulations: 1
network.num_branches: 9
ann_network.total_num_model_parameters: 9
ann_network.total_num_sites: 2001
ann_network.network.num_tips: 4
number of partitions in the MSA: 1
sample_size n: 8004
param_count k: 18
logl: -5670.95
bic: 11503.7
(line 5284 to 5307, here used SamplingType.PERFECT_SAMPLING and LikelihoodType.BEST)
Extensive BIC info before applying current ArcInsertionMove move 26/ 26:
multiplier: 1
network.num_reticulations: 0
network.num_branches: 6
ann_network.total_num_model_parameters: 9
ann_network.total_num_sites: 2001
ann_network.network.num_tips: 4
number of partitions in the MSA: 1
sample_size n: 8004
param_count k: 15
logl: -5476.83
bic: 11088.5
Extensive BIC info after applying current ArcInsertionMove move 26/ 26:
multiplier: 1
network.num_reticulations: 1
network.num_branches: 9
ann_network.total_num_model_parameters: 9
ann_network.total_num_sites: 2001
ann_network.network.num_tips: 4
number of partitions in the MSA: 1
sample_size n: 8004
param_count k: 18
logl: -5477.52
bic: 11116.8
(line 7237 to 7260, here I used SamplingType.PERFECT_UNIFORM_SAMPLING and LikelihoodType.AVERAGE)
Extensive BIC info before applying current ArcInsertionMove move 26/ 26:
multiplier: 1
network.num_reticulations: 0
network.num_branches: 6
ann_network.total_num_model_parameters: 9
ann_network.total_num_sites: 2000
ann_network.network.num_tips: 4
number of partitions in the MSA: 1
sample_size n: 8000
param_count k: 15
logl: -5451.47
bic: 11037.7
Extensive BIC info after applying current ArcInsertionMove move 26/ 26:
multiplier: 1
network.num_reticulations: 1
network.num_branches: 9
ann_network.total_num_model_parameters: 9
ann_network.total_num_sites: 2000
ann_network.network.num_tips: 4
number of partitions in the MSA: 1
sample_size n: 8000
param_count k: 18
logl: -5452.16
bic: 11066.1
(line 9190 to 9213, here I used SamplingType.PERFECT_UNIFORM_SAMPLING and LikelihoodType.BEST)
Extensive BIC info before applying current ArcInsertionMove move 26/ 26:
multiplier: 1
network.num_reticulations: 0
network.num_branches: 6
ann_network.total_num_model_parameters: 9
ann_network.total_num_sites: 2000
ann_network.network.num_tips: 4
number of partitions in the MSA: 1
sample_size n: 8000
param_count k: 15
logl: -5366.1
bic: 10867
Extensive BIC info after applying current ArcInsertionMove move 26/ 26:
multiplier: 1
network.num_reticulations: 1
network.num_branches: 9
ann_network.total_num_model_parameters: 9
ann_network.total_num_sites: 2000
ann_network.network.num_tips: 4
number of partitions in the MSA: 1
sample_size n: 8000
param_count k: 18
logl: -5366.79
bic: 10895.4
This run let to the discovery of two bugs:
Anyway, the reported data I posted above shows that without having partitions in the MSA, the BIC does not support any reticulations... (an observation which is obvious for LikelihoodType.BEST, but less obvious for LikelihoodType.AVERAGE)
EDIT: At least for LikelihoodType.AVERAGE, the issue could also be some bug in the loglikelihood computation here. Need to double-check this.
Cross-reference to how BIC is currently computed: https://github.com/lutteropp/NetRAX/issues/19#issuecomment-736502937
I am still surprised about why the loglikelihood score got worse after trying an arc insertion. Shouldn't more reticulations always lead to a better-or-equal loglikelihood score?
Let' s wait for the partition bug to be out of our way before trying to understand this arc insertion problem, OK?
I'd agree with Celine, my guess is that the key factor here is that br-lens are not optimized individually on a per-partition basis
alexis
On 04.12.20 11:39, celinescornavacca wrote:
Let' s wait for the partition bug to be out of our way before trying to understand this arc insertion problem, OK?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/22#issuecomment-738680723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6XEUTJ77RRVECMTAQ3STCU4HANCNFSM4UJDHAAA.
-- Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
I found a mistake in the BIC computation: I had confused unlinked branch length mode and linked/scaled branch length mode. The multiplier should be 1 if we have linked/scaled branch length mode. If we have unlinked branch length mode, then the multiplier should be equal to the number of partitions.
I fixed this in https://github.com/lutteropp/NetRAX/commit/753d08ad5b8a813715a25e94a571153de5605ac6. It shouldn't have affected much though...
Alexey explained to me how Branch-length scalers work. It seems to me that if we have scaled branch lengths, the correct number of free parameters for the branch lengths should be number_of_branches + number_of_partitions. Because in scaled branch lengths mode, each partition as its own scaling factor that it uses to scale the otherwise shared branch lengths.
I fixed the computation of param_count for the BIC computation in https://github.com/lutteropp/NetRAX/commit/402fd046c5cae7e6aa927fbadb70fa24bee356fc. The corrected version is:
size_t get_param_count(AnnotatedNetwork& ann_network) {
Network &network = ann_network.network;
size_t param_count = ann_network.total_num_model_parameters;
if (ann_network.fake_treeinfo->brlen_linkage == PLLMOD_COMMON_BRLEN_UNLINKED) {
param_count += ann_network.fake_treeinfo->partition_count * network.num_branches();
if (!ann_network.options.use_nepal_prob_estimation) { // reticulation probs as free parameters
param_count += ann_network.fake_treeinfo->partition_count * ann_network.network.num_reticulations();
}
} else { // branch lengths are shared among partitions
param_count += network.num_branches();
if (!ann_network.options.use_nepal_prob_estimation) { // reticulation probs as free parameters
param_count += ann_network.network.num_reticulations();
}
if (ann_network.fake_treeinfo->brlen_linkage == PLLMOD_COMMON_BRLEN_SCALED) {
// each partition can scale the branch lengths by its own scaling factor
param_count += ann_network.fake_treeinfo->partition_count - 1;
}
}
return param_count;
}
I believe the BIC issue is fixed now. With the corrected param_count k, correct handling of partitioned MSAs, and the changed optimization routine for reticulation probabilities, the BIC for a simulated 4-taxon-network with 1 reticulation is now better than the BIC I get for the zero-reticulation best tree inferred by raxml-ng. It also appears that the MSA is fine.
There is still another problem in the NetRAX search (which leads to sometimes failed assertions, sometimes just inferred networks that score badly). I already have a clear idea of what is likely going wrong and am investigating it: https://github.com/lutteropp/NetRAX/issues/26
Alexey's explanation is correct from my point of view
On 06.12.20 02:07, Sarah Lutteropp wrote:
Alexey explained to me how Branch-length scalers work. It seems to me that if we have scaled branch lengths, the correct number of free parameters for the branch lengths should be number_of_branches + number_of_partitions. Because in scaled branch lengths mode, each partition as its own scaling factor that it uses to scale the otherwise shared branch lengths.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/22#issuecomment-739433739, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6R67IEHEXIACDOI4H3STLDLFANCNFSM4UJDHAAA.
-- Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
I agree with the computation for the linked and scaled modes, but I think that the gammas should be shared by all partitions in the unlinked mode.
the gamma for rate heterogeneity? not necessarily, typically you asssume as separate gamma distribution for each partition as this only induces one extra parameter per partition,
alexis
On 07.12.20 11:01, celinescornavacca wrote:
I agree with the computation for the linked and scaled modes, but I think that the gammas should be shared by all partitions in the unlinked mode.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/22#issuecomment-739778130, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6SIV3JTSVYUNO6TDULSTSKXJANCNFSM4UJDHAAA.
-- Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
When @celinescornavacca speaks about "the gammas", she means the reticulation probabilities
Yes, I do not agree with
ann_network.fake_treeinfo->partition_count * ann_network.network.num_reticulations()
ah, okay ... now it makes sense, and yes they should be linked I believe, i.e., one set of reticul. probs. for the entire dataset
On 07.12.20 12:07, Sarah Lutteropp wrote:
I think when @celinescornavacca https://github.com/celinescornavacca speaks about "the gammas", she means the reticulation probabilities
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lutteropp/NetRAX/issues/22#issuecomment-739815081, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGXB6QE6XSA63OBSWKDCWDSTSSPLANCNFSM4UJDHAAA.
-- Alexandros (Alexis) Stamatakis
Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
www.exelixis-lab.org
Alright! I changed it in https://github.com/lutteropp/NetRAX/commit/ae12ca68eb9be8d3aff1ee31e50d37a147c4c94a. Reticulation probabilities are partition-independent now, no matter which brlen linkage mode is used.
Looks like we can close this issue now. BIC is behaving the way it should now, supporting the reticulations in the simulated datasets and everything.
Because of problems that emerged with the "small_network" experiments (BIC-difference questions), here is the new GitHub issue for verifying the BIC-Scores...
I will post BIC-score information (plus additional data needed for the verification) from a NetRAX run here, where we can be sure that the only difference between the 0-reticulation-network and the 1-reticulation-network lies in a single arc-insertion move that has been performed.