lem-usp / EvolQG

Tools for Evolutionary Quantitative Genetics
http://cran.r-project.org/web/packages/evolqg/
Other
10 stars 8 forks source link

Compare matrices via Mantel Correlation - MantelCor #119

Closed RobertBakaric closed 6 years ago

RobertBakaric commented 6 years ago

Quick questions regarding MantelCor function.

First: When I create replicates using:

reps <- unlist(lapply(cov.list, MonteCarloRep, 10, MatrixCor, correlation = TRUE, parallel=TRUE))

everything works and i get:

[1] 0.8742364 0.9324694 0.9910601

thsi seams ok, but final results:

  MantelCor(cor.list, repeat.vector = reps)
  nperm' >= set of all permutations: complete enumeration.   
  Set of permutations < 'minperm'. Generating entire set.    
  'nperm' >= set of all permutations: complete enumeration.   
  Set of permutations < 'minperm'. Generating entire set.    
  'nperm' >= set of all permutations: complete enumeration.   
  Set of permutations < 'minperm'. Generating entire set.   
 $correlations
                1         2         3
 1 0.8742364 0.8790703 0.7090542
 2 0.7936981 0.9324694 0.9541994
 3 0.6599999 0.9172895 0.9910601

 $probabilities
       1          2 3
 1 0.00000000 0.00000000 0
 2 0.04166667 0.00000000 0
 3 0.04166667 0.04166667 0

Now according to doc, probabilities are not computed (0's upper triangle), right?

Second: If I want to keep probabilities for the significance calculation in reps i should avoid MatrixCor. If I do that and run the calculation in parallel this takes ages. In fact i cannot compute it even after 2 weeks on 120cpu's. What is the problem ? What am I doing wrong?

cheers

diogro commented 6 years ago

Now according to doc, probabilities are not computed (0's upper triangle), right?

The probabilities are computed by MantelCor, and are in the lower triangle of the $probability output. The function that doesn't output probabilities is MatrixCor.

If I want to keep probabilities for the significance calculation in reps i should avoid MatrixCor. If I do that and run the calculation in parallel this takes ages. In fact i cannot compute it even after 2 weeks on 120cpu's. What is the problem ? What am I doing wrong?

How many matrices are you comparing? How big are them?

Using too many cores can hurt more than it helps. Parallelization speedup is not linear and setting up the threads has some overhead, and for the overhead of setting up the threads to be worth it you need to be comparing a lot of matrices. Using too many cores won't help because creating all those threads is more overhead than it's worth.

For 100 15x15 matrices I get these times using variable number of cores:

library(evolqg)
library(microbenchmark)
library(doMC)

cor.list = RandomMatrix(15, 100)
microbenchmark(MantelCor(cor.list),
               {registerDoMC(2); MantelCor(cor.list, parallel = TRUE)},
               {registerDoMC(3); MantelCor(cor.list, parallel = TRUE)},
               {registerDoMC(4); MantelCor(cor.list, parallel = TRUE)})
#Unit: seconds
#expr      min       lq     mean   median       uq      max neval
#MantelCor(cor.list) 320.1615 320.1615 320.1615 320.1615 320.1615 320.1615     1
#{     registerDoMC(2)     MantelCor(cor.list, parallel = TRUE) } 197.6104 197.6104 197.6104 197.6104 197.6104 197.6104     1
#{     registerDoMC(3)     MantelCor(cor.list, parallel = TRUE) } 179.7237 179.7237 179.7237 179.7237 179.7237 179.7237     1
#{     registerDoMC(4)     MantelCor(cor.list, parallel = TRUE) } 196.0484 196.0484 196.0484 196.0484 196.0484 196.0484     1    

We see some advantage of parallelization, but not linear (using 2 cores doesn't make it twice as fast). 3 cores is best, 4 is already getting worse (but this is related to me doing other things in my 4 core machine as I ran this). Usually anything above 4-6 cores is only going to make things worse.

RobertBakaric commented 6 years ago

Thank you. But it seams I have shifted the focus from my problem by thinking the parallelization is the issue. But at this point I think I am using it incorrectly. Let me restate my poblem:

I have 3 4x4 matrices

If I execute:

reps<-unlist(lapply(cov.list, MonteCarloRep, 30, MatrixCor, correlation = TRUE))

I get a result.

If I execute:

reps<- unlist(lapply(cov.list, MonteCarloRep, 30, correlation = TRUE))

it runs forever.

Question 1: Why? Question 2: does 30 refer to the number of individuals i.e. if 30 individuals was used to make each covariance matrix. Or this number should stand for the number of random replicates (e.g 1000), irrespectively to the sample size of the real sample per population?

Now when I run MantelCor (I used cov2cor function)

MantelCor(cor.list, repeat.vector = reps)

i get two matrices (as posted above) and in probability matrix i get always the same result (regardless whether I used or did not use repeat.vector -> $probabilities in the original post) So this confuses me and brings me to additional questions regarding montecarlo simulations: Question3: How many replicates was used for MonteCarloRep in the function above? I assumed 1000 by the default function for MonteCarloRep which should be enough to at least affect my probabilities a little (or am I misunderstanding something again).

Question4: How many permutations were used in MantelCor in the function above? (I assume 1000- is this correct)

diogro commented 6 years ago

Thank you. But it seams I have shifted the focus from my problem by thinking the parallelization is the issue. But at this point I think I am using it incorrectly. Let me restate my poblem:

I have 3 4x4 matrices

If I execute:

reps<-unlist(lapply(cov.list, MonteCarloRep, 30, MatrixCor, correlation = TRUE))

I get a result.

If I execute:

reps<- unlist(lapply(cov.list, MonteCarloRep, 30, correlation = TRUE))

it runs forever.

Question 1: Why?

MonteCarloRep needs a comparison function with which to calculate the repetability. I'm not sure what is happening with the second call, but presumably R in hanging on the missing comparison call. I'll have to investigate further and add an error, but calling MonteCarloRep without a comparison function is not a valid call. Your first call with MatrixCor is the correct one for this case.

Question 2: does 30 refer to the number of individuals i.e. if 30 individuals was used to make each covariance matrix. Or this number should stand for the number of random replicates (e.g 1000), irrespectively to the sample size of the real sample per population?

MonteCarloRep works by using the supplied covariance matrix as the parameter in a multivariate gaussian distribution. Samples are taken from this from this distribution and a statistic (either the correlation or the covariance matrix) is calculated on these samples. This statistic is then compared to the supplied covariance matrix using the comparison function. The sample.size argument gives the size of these samples of the multivariate normal distribution. Usually we use either the original sample used to calculate the supplied covariance matrix, or the minimal sample in all the covariance matrices.

Question3: How many replicates was used for MonteCarloRep in the function above? I assumed 1000 by the default function for MonteCarloRep which should be enough to at least affect my probabilities a little (or am I misunderstanding something again).

The number of times this procedure is done defaults to 1000, and can be set using the iterations argument.

So, to use 30 individuals for calculating the correlation matrix 500 times you would do:

reps<-unlist(lapply(cov.list, MonteCarloRep, sample.size = 30, MatrixCor, correlation = TRUE, iterations = 500))

i get two matrices (as posted above) and in probability matrix i get always the same result (regardless whether I used or did not use repeat.vector -> $probabilities in the original post) So this confuses me and brings me to additional questions regarding montecarlo simulations:

The probabilities do not depend on the repeatabilities, so using the repeat.vector will have no effect on them. Repeatabilities only correct the correlations, not their significance.

Question4: How many permutations were used in MantelCor in the function above? (I assume 1000- is this correct)

Yes, that is correct, but with 4 traits 1000 is more than the number of possible permutations, so the function will throw a warning and just do all possible permutations.

RobertBakaric commented 6 years ago

Thank you very much. The above comments cleared all my confusions.

Thank you