Only testing clades of a certain size

CharlieCarpenter commented 2 years ago

Hello, I working on combining this method with kernel machine tests. We only want to test on clades with 20 or more leaf nodes. I feel that this should be a simple if else statement, but I am having a lot of trouble updating the code to make it work. It would be amazing if you had any isight into how to make this happen. Thank you!

merlebehr commented 2 years ago

First of all, thanks a lot for your interest in treeSeg!

The testing procedure which treeSeg performs is constructed in such a way that it estimates the smallest number of segments (active nodes) such that the multiscale statistic accepts, i.e., locally on all sub-segments individual LR tests accept. Say your data is such that treeSeg estimates K active nodes.

If you only want to consider clades of a minimum size, e.g., 20, then there are three possibilities:

1) there exists a segmentation which satisfies the multiscale constraint with minimum clade size of 20 which consists of just K active nodes 2) there exists no such segmentation, but there exists one with more than K active nodes 3) there exists no segmentation (with arbitrary number of active nodes) which has minimum clade size of 20 and satisfies the multiscale constraint

In the first case, you can derive a valid segmentation from the confidence set which is provided by treeSeg. The current implementation only returns the union of all active nodes in the confidence set (the confSetAN element of the returned list). If the final solution only has a single active node, then this equals the set of one element sets which constitute all valid segmentations. If there is more than one active node, then you can recover this from the “combCS” list, which you find in line 220 of the treeSeg.R file. combCS is a list with one element per inner node of the tree, which correspond to the treeSeg solutions on the respective sub-trees. The $comb element of this list contains the confidence set for the active nodes.

In the second case, the situation is more complicated and would require changes of the algorithm at several places, as a solution which satisfies the multiscale constraint but has a clade size of fewer than 20 nodes would not be valid for the sub-trees either.

In the third case, there would be no solution to the problem.

As you see, unfortunately there is no super easy way to incorporate a minimum clade size in treeSeg. If you want to follow the first approach and you have further questions, let me know. Hope this was somehow helpful.

Best, Merle

CharlieCarpenter commented 2 years ago

Hello. Thank you for getting back to me so quickly and for making this package! Very cool method and a great paper.

I'm not exactly sure what you mean by "segmentation." My understanding of the method, is that you are testing within clade to find "change points" along the tree. These change points are the "active nodes." For us, we are uninterested in testing within any clade with less that 20 leaf nodes.

I was hoping that there would be a simple solution such as

IntegerVector tips = getOffspringTip(Anc[i],tree);

if(tips.length() > 20) { * do tests } else{ *skip\ }

This might require me updating some of your code in segTree before the large while statement, but I just wanted to see if it was even possible?

Also, we understand that we are stretching the multiscale theory, but we are okay with that for our purposes.

Thank you!

merlebehr commented 2 years ago

Hallo,

Yes, you are right, treeSeg searches for change-points in the distribution of the tip nodes of the tree. These change-points corresponds to the active nodes (inner nodes in the tree) via their respective offspring tip nodes. With "segmentation" I am referring to the segmentation of the tip nodes which is induced by the active nodes.

If you just don't want to perform any tests on clades with less than 20 tip nodes, you can achieve this by modifying the “lengths” argument in the treeSeg() R function. Currently, you can chose between the arguments “all” and “dyadic”, which corresponds to performing tests on “all” intervals and only on those with “dyadic” lengths. If you modify line 145-153 of “treeSeg.R” in such a way that you select “lengths <- 20:n”, then this implies that you perform tests on all intervals with at least 20 tip nodes.

Hope this helps?

Best, Merle

ansariazim commented 2 years ago

Not sure if this other software helps in any way or not:

https://github.com/ansariazim/treeBreaker

Best. Azim

CharlieCarpenter commented 2 years ago

Oh is it really that simple! Thank you Merle that is a huge help. And thank you Azim I will look into your paper as well. It looks very interesting from the abstract.

CharlieCarpenter commented 2 years ago

Hello again. I have tried updating the "lengths" variable in the the treeSeg function. I believe that this will also require updating the "startLi" variable in segTree, correct? As of now it only detects lengths.length() = n or dyadic. I've tried to make a change to have it do the same thing with "startLi" for lengths = 20:length(y) as it does for lengths = 1:length(y). I'm not sure if this is going to mess things up further down though.

Thank you!

merlebehr commented 2 years ago

Hi Charlie,

You are right, one also has to update the startLi variable in the segTree.cpp function. Sorry, I totally forgot about that.

The startLi variable contains the indices of the confidence bounds where the respective intervals with new left values start. The bounds are generated with the stepR::bounds function and they are ordered by the start (left) index of the intervals and then by the end (right) index.

So when you update line 13-30 of segTree.cpp according to your interval system, this should work.

CharlieCarpenter commented 2 years ago

Hi Merle,

Please don't be sorry! I really appreciate you being so responsive.

I have been trying to figure this step out today, but I've had no luck so far. I'm guessing that I need to update the "getiRi" function? Or will this still work when I use allInt = 1?

merlebehr commented 2 years ago

Hi Charlie, the "getiRi()" function should work with allInt = 1 also for general interval lengths.

MWSchmid commented 1 month ago

Hi Merle

Thanks for the library :)

I was wondering what dyadic lengths would be, found it:

upTo <- floor(log2(length(vectorWithPhenotypes))) lengths <- 2^(0:upTo) lengths

Is there a particular reason for this stepping? Is it especially optimal in a way that it should uncover most of the cases with active nodes? Does it rely on a particular assumption regarding the tree?

I just have 348 tips, but scanning ~ 10 k phenotypes takes a bit long with all lengths so I was wondering whether it makes sense to just speed up with length = dyadic and then to eventually check some in greater detail. Or - can you a priori tell which cases take a very long time? E.g. the ones that have only few or many "1" in the phenotype vector?

Best,

Marc

merlebehr commented 1 month ago

Hi Marc,

In general, if you restrict the lengths of candidate intervals, then confidence statements will still be valid, but you lose detection power. On the other hand, working with all lengths might be computationally too expensive. The motivation for dyadic lengths is that they provide a trade-off, containing intervals on all different scales.

If you had prior knowledge that active nodes do not appear on subtrees smaller than a certain size, then it would make sense to exclude those lengths as well.

In general, computation time increases with the number of detected active nodes, which corresponds to the clustering of 1's and 0's in your phenotype but not to the total number of 1's or 0's.

My suggestion would be to apply the 'treeTest' function first, which is much faster than 'treeSeg' and which gives a lower bound on the number of active nodes. Note that if treeTest outputs zero, then treeSeg will also not detect any active nodes, so you don't need to run treeSeg in that case.

Hope this helps!

Best, Merle

On Tue, 13 Aug 2024 at 10:36, Marc @.***> wrote:

Hi Merle

Thanks for the library :)

I was wondering what dyadic lengths would be, found it:

upTo <- floor(log2(length(vectorWithPhenotypes))) lengths <- 2^(0:upTo) lengths

Is there a particular reason for this stepping? Is it especially optimal in a way that it should uncover most of the cases with active nodes? Does it rely on a particular assumption regarding the tree?

I just have 348 tips, but scanning ~ 10 k phenotypes takes a bit long with all lengths so I was wondering whether it makes sense to just speed up with length = dyadic and then to eventually check some in greater detail. Or - can you a priori tell which cases take a very long time? E.g. the ones that have only few or many "1" in the phenotype vector?

Best,

Marc

— Reply to this email directly, view it on GitHub https://github.com/merlebehr/treeSeg/issues/3#issuecomment-2285680801, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJO5NENQIQDQZQHA5IYFDZLZRHAPXAVCNFSM6AAAAABMNYRDG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBVGY4DAOBQGE . You are receiving this because you commented.Message ID: @.***>

MWSchmid commented 3 weeks ago

Hi Merle

Thanks for the help. I did add treeTest and there's some improvement. There is another problem though: a handful of cases take very long (at least days). I combined now treeTest(), treeSeg(), and treeSeq(lengths = "dyadic") with the withTimeout() function from R.utils. I thought that together with mclapply(), one can scan a few thousand phenotypes pretty well:

library("treeSeg")
library("parallel")
library("R.utils")

f.do.test <- function(x, myTree, myAlpha = 0.01) {
  initRes <- withTimeout({treeTest(as.numeric(x), myTree, alpha = myAlpha)},timeout=300, onTimeout="warning")
  if (sum(grepl("reached", initRes)) > 0) {
    ansSim <- list(numbAN = NA)
  } else {
    if (initRes > 0) {
      ansSim <- withTimeout({treeSeg(as.numeric(x), myTree, alpha = myAlpha)},timeout=300, onTimeout="warning")
      if (sum(grepl("reached", ansSim)) > 0) {
        ansSim <- withTimeout({treeSeg(as.numeric(x), myTree, alpha = myAlpha, lengths = "dyadic")},timeout=300, onTimeout="warning")
        if (sum(grepl("reached", ansSim)) > 0) {
          ansSim <- list(numbAN = NA)
        }
      }
    } else {
      ansSim <- list(numbAN = 0)
    }
  }
  return(ansSim)
}

# phenoTest is a list/df with one phenotype per column
testRes <- mclapply(phenoTest, function(x) f.do.test(x, phyloTree), mc.cores = 30)

However, I think that the timeout does not work as I hoped for because time limits are only checked whenever a user interrupt could occur. In the C/Fortran code, it only happens if the code checks for user interrupts. And for atomic operations that take very long, it would also not work. In case it's the first, it would be a nice feature for the future :)

Best,

Marc

merlebehr / treeSeg

Only testing clades of a certain size #3