Interactions: ideas/question

paternogbc commented 7 years ago

hey guys,

are you using the previous single interaction functions to build the interaction functions between uncertainties?

For example, in tree interaction with influ, we could just run influ_xxx function between all trees and save the output into a list. So we would have as res a list with n trees full outputs from influ. This could be done with the others if we focus on the main interaction uncertainty (tree and intra).

For example:


for (n in 1:trees){
run influ_phylm()
}

Not sure if I have made my self clear enough. But just so we think about this.

caterinap commented 7 years ago

It's clear and it's a good idea (don't know why I did not think about it before :). I'll try to do this in the functions I am working with!

paternogbc commented 7 years ago

Hey Cat,

nice! I also wondered why I did not implement that approach before. But it seems like a good coding idea to avoid other problems and let the original functions untouched. However this would involve some type of standardization for the output and storing the results. I have some ideas for that. Perhaps we should discuss by skype some solutions for it before implementing more generally?

gijsbertwerner commented 7 years ago

You are right, this sounds like a really obvious thing to do. Like Cat, this is not really what I actually did, instead nesting the two loops within a new interaction_xx function. Conceptually, I think probably your solution would have been much easier. I suppose that the main problem would be output-related, but probably not really something we couldn’t really solve? I am open to a Skype about this (our general Skype we were going to have anyway on Friday?)!

Dr. Gijsbert Werner Newton Fellow, Department of Zoology Junior Research Fellow, Balliol College University of Oxford

Balliol College Broad Street Oxford OX1 3BJ United Kingdom

On 12 Jun 2017, at 16:48, Gustavo Paterno notifications@github.com<mailto:notifications@github.com> wrote:

Hey Cat,

nice! I also wondered why I did not implement that approach before. But it seems like a good coding idea to avoid other problems and let the original functions untouched. However this would involve some type of standardization for the output and storing the results. I have some ideas for that. Perhaps we should discuss by skype some solutions for it before implementing more generally?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://github.com/paternogbc/sensiPhy/issues/147#issuecomment-307831062, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AGauyNqQEQ365uSZ4A4QIBH6hShdHA4dks5sDV3SgaJpZM4Nk1u9.

caterinap commented 7 years ago

By nesting the two loops we also save some steps which otherwise we might need to two twice. So maybe the nesting is a little bit more efficient (not sure though), but the advantages of having less functions are also pretty evident. Should I start implementing it or just stop coding until we take a decision?

paternogbc commented 7 years ago

Nice guys! It is true, the nesting save some steps and probably speeds up. However, I think for code standards it is much better that we improve the simple functions later and everything just keep working instead of building fully new functions. So think it is worth it the approach. For the output I have this idea: The output would just be a list a lists with each full output from the looping function. I think I would for the implementation Cat! Perhaps a different branch so if we find a huge problem with that we can go back.

caterinap commented 7 years ago

Issues to solve, common to all functions (please list them below by editing the message):

1- all functions will run match_dataphy at each iteration and print each time "Used dataset has xx species that match..." and warning messages --> I added a "verbose" argument to match_dataphy, specifying in the help that verbose = F is just for advanced use (or it can be hidden). If ok, we need to add ... to match_dataphy every time it is used inside a function (mostly done in branch interactions_from_single

2- I (Cat) created all functions with the output stored in res <- list(), we can see later how summaryand sensi_plot can deal with it

3- setTxtProgressBaris driving me crazy! Needs to solve this later for all functions.

4- in most of the functions phylolm( ) or phyloglm( ) do not have the ... meaning that people cannot pass arguments there. This should probably be changed everywhere. But, it also creates problems when passing arguments to other functions (e.g. the verbose = FALSE of match_dataphy). Not sure how to deal with it.

paternogbc commented 7 years ago

Simple solution to collapse into a data.frame the same slot in a list of lists as we get from the nested function approach.

recombine <- function(list, slot){
  x <- lapply(list, function(x) x[[slot]])
  do.call("rbind", x)
}

Basically this functions first gets only the specific slot within each list (ex. only the full estimates or only the model estimates) then join them into a big data.frame. The only thing is that the slot that you choose must be a data.frame for this to work ok.

The slot argument indicates each slot of the output should be collapsed:

For example, in the function samp.phylm(), each output has the following slots:

#Generates output:
res <- list(call = match.call(),
            model = model,
            formula = formula,
            full.model.estimates = param0,
            samp.model.estimates = samp.model.estimates,
            sign.analysis = perc.sign.tab,
            data = full.data)

Therefore, recombine(list = output, slot = 4) would collapse the samp.model.estimates data.frame of each samp_phylm analysis into a big data.frame. The number of data.frames joined by rbind is equal to the number of trees (for interaction_tree_xxx functons) or the number of simulations (for the interaction_intra_xxx functions).

After that we also need to include one column that indicates the n.tree (the number of the tree used or the n.sim the number of the simulation in intra functions). Basically this would be the only difference on the big joined data.frame with the estimates of all runs: the extra column indicating the tree/intra simulation used whithin each independent sensiPhy analysis.

So I will try to include this approach on the interaction_tree_clade_phylm function and we can then replicate that to the other functions. =)

Let me know what you think later.

paternogbc commented 7 years ago

Ok Guys,

I made an more general function that can deal with all sorts of outputs from our functions.

### Function to merge data.frame from lists for the interaction functions
recombine <- function(list, slot1, slot2 = NULL){
  ### One level indexing list[[slot1]]:
  if (is.null(slot2)){
    nam <- as.numeric(names(list))
    x <- lapply(list, function(x) x[[slot1]])

    if(is(x[[1]],"list"))
      stop("Please also provide a value for slot2")
    if(is.na(match(class(x[[1]]), c("data.frame", "matrix"))))
      stop("Check if slot indexing returns a valid data.frame or numeric value withn your list")

    n.rows<- nrow(x[[1]])
    N <- rep(nam, each = n.rows)
    res <- data.frame(iteration = N, do.call("rbind", x))
    return(res)
  }

  ### Two levels indixing list[[slot1]][[slot2]]
  if(is.null(slot2) == FALSE){ 
    nam <- as.numeric(names(list))
    x <- lapply(list, function(x) x[[slot1]][[slot2]])

    if(class(x[[1]]) == "data.frame" | class(x[[1]]) == "matrix" | class(x[[1]]) == "numeric" ){
      n.rows<- nrow(x[[1]])

      ### If there is only a single value (e.g. AIC, optpar)
      if(is.null(n.rows)) {
        n.rows <- length(x[[1]])
        est.name <- names(list[[1]][[slot1]])[[slot2]]
        N <- rep(nam, each = n.rows)
        res <- data.frame(iteration = N, do.call("rbind", x))
        colnames(res)[2] <- est.name
        return(res)
      }

      else
        N <- rep(nam, each = n.rows)
      res <- data.frame(iteration = N, do.call("rbind", x))
      return(res)
    }
    else stop("Check if slot indexing returns a valid data.frame or numeric value withn your list") 
  }
}

You can test this function to see how it looks on your development.

The idea here is that some of our outputs from the simple functions (for instance, clade_phylm) are lists them selfs. For example, the full.model.estimates in the example above is a list with 3 objects:

$coef
                Estimate     StdErr   t.value      p.value
(Intercept)    4.9028073 0.31461264 15.583631 1.132208e-27
log(adultMass) 0.2471781 0.03707587  6.666818 1.835804e-09

$aic
[1] 48.15533

$optpar
[1] 0.7892178

So if you would like to make a big data.frame for the full.estimates ($coef) you should use:

recombine(clade.list, slot1 = 4, slot2 = 1)

slot1 defines the first levels within our outputs: call = 1 model = 2 formula = 3 full.model.estimates = 4 clade.model.estimates = 5 null.dist = 6

So by choosing slot1 = 4 we are selecting the full.model.estimates layer. However, because this output is a list we must choose a second slot to be summarized (slot2).

       $coef = **1**
       $aic = 2
       $optpar = 3

So using recombine(clade.list, slot1 = 4, slot2 = 1) will create a joined data.frame with the full.estimates coefficients for each tree.

That how it looks like:

              iteration  Estimate     StdErr   t.value      p.value
(Intercept)            22 4.9249134 0.31639388 15.565767 1.223342e-27
log(adultMass)         22 0.2449646 0.03704520  6.612588 2.356017e-09
(Intercept)            56 4.8537240 0.30666741 15.827323 3.953928e-28
log(adultMass)         56 0.2544114 0.03628947  7.010612 3.722179e-10
(Intercept)            98 4.9207738 0.31272060 15.735368 5.875581e-28
log(adultMass)         98 0.2453387 0.03685528  6.656814 1.922369e-09
(Intercept)            36 4.8846462 0.30894226 15.810871 4.243952e-28
log(adultMass)         36 0.2499889 0.03655089  6.839475 8.261405e-10
(Intercept)            26 4.8401293 0.31224353 15.501136 1.619329e-27
log(adultMass)         26 0.2571136 0.03680982  6.984919 4.196954e-10

Using recombine(clade.list, slot1 = 5) will create a joined data.frame with all clade.model.estimates together:

      iteration           clade N.species intercept DFintercept intercept.perc pval.intercept     slope      DFslope slope.perc   pval.slope      AIC    optpar
22.1         22  Callitrichidae         9  5.142752  0.21783874            4.4   1.445844e-27 0.2163014 -0.028663258       11.7 1.837734e-07 33.28191 0.7976327
22.2         22         Cebidae        19  5.035437  0.11052391            2.2   2.746676e-25 0.2217781 -0.023186562        9.5 4.416898e-07 40.39848 0.7338534
22.3         22 Cercopithecidae        32  4.557833 -0.36708029            7.5   1.482572e-20 0.2993025  0.054337895       22.2 2.549551e-10 42.78957 0.7420038
22.4         22       Lemuridae         8  4.865002 -0.05991120            1.2   9.484201e-26 0.2535154  0.008550784        3.5 2.750156e-09 44.17282 0.8064985
56.1         56  Callitrichidae         9  5.050543  0.19681858            4.1   6.320092e-28 0.2285631 -0.025848356       10.2 2.723531e-08 35.49951 0.7478180
56.2         56         Cebidae        19  5.013987  0.16026325            3.3   2.198563e-25 0.2247071 -0.029704340       11.7 3.014764e-07 40.62690 0.7179453
56.3         56 Cercopithecidae        32  4.486826 -0.36689779            7.6   4.352144e-21 0.3082878  0.053876399       21.2 3.832994e-11 43.74677 0.6799792
56.4         56       Lemuridae         8  4.777488 -0.07623612            1.6   1.776534e-26 0.2656281  0.011216725        4.4 2.260694e-10 45.77215 0.7530338
98.1         98  Callitrichidae         9  5.085459  0.16468482            3.3   8.637622e-28 0.2236782 -0.021660530        8.8 6.198802e-08 34.85068 0.7679321
98.2         98         Cebidae        19  5.038283  0.11750950            2.4   2.829098e-25 0.2211371 -0.024201566        9.9 5.255320e-07 40.38519 0.7326881
...

I am not sure if I was clear, but if you try a little bit and let me know if you can use/understand it would be great.

iteration represents the tree number for interaction_tree_xxx functions. I have to see if this stills works for intra_xxx functions. Next moves. =)

paternogbc commented 7 years ago

One idea to standardize and differentiate the times argument name within nested loops functions. We all had some problems with that earlier, how to differentiate the arguments setting the number of trees and intra simulations in nested functions? So I came up with this solution.

do not use times anymore
use n.sim (number of simulations) instead of times for the simple functions: clade and samp
use n.tree (number of trees) for the simple functions with thre tree method
use n.intra (number of simulated intra data) for the intra method

For the interaction functions: use n.tree or n.intra to indicate the number of trees / data to loop and n.sim to indicate the number of simulations to use in clade or samp methods.

In this way we can solve the times issue and be more specific in the argument naming. It would cause a small change on the simple functions also but it is very easy to do it with the Rstudio functionality replace in scope. What do you think?

E.g:

#for tree_samp functions:
function(n.tree = 50, n.sim = 500, ...)

#for tree_intra functions:
function(n.tree = 50, n.intra = 100, ...)

gijsbertwerner commented 7 years ago

Cool, Gustavo this is really great!

I have been working on this in the interactions_from_single branch, where Caterina had started to construct the interaction functions using the original 'single' functions. I have taken the liberty of copying your new utils.R function to this branch, so that we also have it there too, because that would be useful for testing before merging.

I was working on interaction_intra_infly_phylm, however, and to get it to work there I had to slightly adjust your code. The problem was when generating the 'nam' object, which ultimately will trace the iteration/resimulation numbers in the final output, that when I did this on an influ-list it did not work because in that list items don't have names resulting in an error.

To solve this I have now replaced that line of code with nam<-c(1:length(list)). This achieves the same thing what you were trying to achieve, i.e. a numeric vector of the length of the original list (but correct me if I am wrong), so I hope/think this should not break it when you use it as you originally did (but haven't tested yet).

Really clever function!

gijsbertwerner commented 7 years ago

With regard to the times proposal to change this to n.sims, n.intra and n.tree. Yes, this is great and much more consistent! I will here take the liberty to start changing this in 'single' functions where we would need to do this.

gijsbertwerner commented 7 years ago

Ok, so I have now used all of this to create a summary output for the intra_influ function I had created earlier today (Thanks Caterina, your example was so helpful.

Something quite nice, that I hadn't really thought about is that, if we cleverly design the output, people can use summary(output)to evaluate/summarise the whole output list as a sensiIntra_Influ object, and then evaluate a given item within that list as a 'regular' influ_phylm output, i.e.: summary(output[[1]]). I think the is really neat and useful, so have mentioned in the new help!

You can try it if you pull from the branch and run:

set.seed(01235) testrun<-interaction_intra_influ_phylm(formula =gestaLen ~ adultMass, y.transf = log, x.transf = log, phy = alien$phy[[1]], data = alien$data, Vy = "SD_gesta",n.intra = 10) summary(testrun) summary(testrun[[1]])

I think this is nice because it also illustrates a nice biological point, i.e. that infraspecific variation does influence the most influential species for the slope differently from the most influential effect on the intercept. This is a quite useful distinction, biologically and conceptually, I would say.

I will try and implement the same for the other functions shortly. Should be easy to do now.

paternogbc commented 7 years ago

Hey @gijsbertwerner ,

Very nice work! Regarding the 'recombine' function. The problem is that we should name the list after finishing the loop before handling the the object to 'recombine'. This is important to keep the number of the tree used in each iteration in the case of tree_xxx as main uncertainty and the repetition number on the intra_xxx functions.

Thus, you can look this commit where I added this small piece of code in Caterina function before handling the list to recombine:

names(clade.tree) <- trees

This gives tree number as the list name for each iteration. Then recombine will include this in the iteration column as you can see above. Is that clear or too messy?

paternogbc commented 7 years ago

The summary(testrun[[1]]) finding is really nice! Very cool indeed! =) We should definitely include in the help and the online tutorial! Great catch!

gijsbertwerner commented 7 years ago

Cool, yes, Gustavo, I now better understand how to use recombine having looked at 913b26fcb91f99632bf720762ac2b8ae8443dd80, and have added a similar short list of code where I simply assign the list of 'single' output a range of names.

A second I noticed in that commit is that in Interaction_tree_clade_phylm you / @caterinap do most of the processing using recombine already within the function and create a processed output res. In contrast, what I had done in interaction_intra_influ is simply output a list of 'single' output, and then I use recombine in the summary function. I suppose that the advantage of your approach is that the output is easier to read and interpret for people, but the advantage of my approach is we can do the summary(testrun[[1]]) thing. What do you think? Is it ok if we have different ways of doing it? Or should we standardise?

paternogbc commented 7 years ago

Hey @gijsbertwerner,

I would say that we should definitely standardize the approach. So between the two option I think processing the list into a single data.frame within the function is the best way because we can keep the structure of res alike in all function. The summary(res[[1]]) it is super cool I think. But the users could also achieve similar results by running the simple function instead, right? So I would say we should stick with recombine within functions instead of inside summary.

gijsbertwerner commented 7 years ago

Well, I suppose that for the interaction_tree_xx functions users could indeed replicate the same thing by using single functions on a particular tree, but not for intra and/or samp based interaction functions, because there the dataset is recreated with each iteration. After all, in the res object we would only store the 'original' data call, not the resimulated versions, right?

Alternatively we could store all the resimulated dataset within res for samp/intra interactions, but I suppose then the structure of res would be different among functions.

Anyway, I think you are probably right that calling recombine within the function is cleaner, but outside of it is more flexible for the user in terms of what they can do with the result (I guess they can always manually call recombine or similar on it).

caterinap commented 7 years ago

Ok, here is what I have done:

tree_samp: lm, glm done + modified summaryto show the first column ("iteration") if the object is an interaction (with tree or Intra) - but feel free to reverse the change on the summaryif you don't like it
tree_clade: lm and glm ok - I need to work on the summaryto show the "iteration" column (if we decide to go this way)
tree_influ: lm and glm are written but they need more work to find a nice way to report influential species using recombine
tree_intra: lm and glm ok - this still uses two nested loops: decide if we want to use intrain a treeloop, or leave it as it is, since we want uncertainty from both I would leave it as it is (summary and plot are working)

I am away until the 9 of July (sorry, I did not manage to finish!!), I'll be happy to work on it when I am back.

paternogbc / sensiPhy

Interactions: ideas/question #147