stamats / MKinfer

R package MKinfer
6 stars 0 forks source link

perm and boot.t.test #1

Closed Dokmen closed 3 years ago

Dokmen commented 3 years ago

Hi,

I used this function for my microarray data set. In the gene expression matrix, rows contain genes and columns contain samples. There are 5[1:5] controls and 26[6:31] treatment samples. I want to compare each genes between two groups and use your functions for determine expressed genes. When I use your test with apply function for R=999, the results are taking days. Additionaly df is so large, for this data it should be 29 for each row, but here 12.68958. The code for each row with your function:

apply(data , 1, function(x) { MKinfer::boot.t.test(data[6:31], data[1:5], alternative = "two.sided", var.equal = TRUE, R=999) })

how can we run this code fastest or can you make a suggestion?

stamats commented 3 years ago

Dear Akyanus,

with my function I don't see a direct way of speeding it up such that it would work much quicker in your situation. You could achieve some speed up by doing the computations for the genes by parallel computing.

On the other hand by adapting the code included in my function in might be possible. You would have to do the bootstrapping and the computations in one-step on your whole dataset (Student t-test: lines 76-84, Welch t-test: lines 90-101); i.e., transform your data matrix with the gene expression data into two long vectors and compute the required statistics for all genes simultaneously.

I might add such a functionality to my package MKomics or MKpower (sample size and power simulations for boot./perm.t.test), but currently I don't have time for this.

I can not confirm the issue with the df. I tried it with some data and got the correct df. Are you sure, that you had var.equal = TRUE in your computation? Could please verify this.

Best Regards, Matthias

Dokmen commented 3 years ago

Thank you very much for your response. I just speeded it up a little with the snow package with these lines:

`library(snow) cl <- makeCluster(48) clusterEvalQ(cl, library(MKinfer)) system.time(cl.boot <- parApply(cl, data, 1, function(x) { boot.t.testt(x[6:31], x[1:5], mu=0, alternative = "two.sided", var.equal = TRUE, R=999)}))

user system elapsed 0.408 0.037 2.704 `

this code seems to work. But it was too fast. Is it normal?

It would be very nice if you add such a feature to MKomics and MKpower package. I have been trying to do this bootstrap analyzes for days. These perm and bootstrap are very important for my project and I'm pretty confused. For example, I executed the analysis using the boot package, but I guess I'm following the wrong path. According to the p values I obtained, my adj.p values seem unusual and these values are close to 1. So I have no expression. (I used EdgeR glm and limma, too, I obtained good results). Davison and Hinkley (1997) discussed the value of adj.p in chapter 4, page 175. I booted the p value using the t.test function in the boot package. What I don't understand is how to get the adj.p value based on the calculated confidence intervals and the boot.p value in your package. I calculated adj.p using the boot.p value directly. I guess that's wrong. What is your opinion?

In addtition, When I run only this line: MKinfer::boot.t.test(data[6:31], data[1:5], alternative = "two.sided", var.equal = TRUE, R=10)

the res is:

Bootstrapped Two Sample t-test

data: data[6:31] and data[1:5] bootstrapped p-value < 2.2e-16 95 percent bootstrap percentile confidence interval: -0.03981947 -0.01644654

Results without bootstrap: t = -3.7637, df = 725057, p-value = 0.0001674 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.04300381 -0.01355230 sample estimates: mean of x mean of y 0.9973691 1.0256471

Results without bootstrap: var.equal=FALSE t = -3.8021, df = 166621, p-value = 0.0001436 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.04285553 -0.01370059 sample estimates: mean of x mean of y 0.9973691 1.0256471

When I run with snow package the result for df is 29.

stamats commented 3 years ago

The computation time with snow is unbelievable low. Hence, I have doubts that the computation really worked. Did you look at the result; i.e. variable cl.boot?

I would also tend to recommend limma. My package MKomics has function mod.t.test, which includes a two-sample t-test based on limma.

Regarding the adjustment of p values in case of the bootstrap t-test: I would proceed as usual. That is, first perform all tests and save the p values. Next, I would compute adjusted p values with function p.adjust or with Bioconductor package multtest.

Running the function with the whole dataset is also not what you want. The idea I proposed would really mean that one would have to implement a new function that operates on data matrices.

Dokmen commented 3 years ago

In the results, I could'nt see bootstrapped t value, so I do not have an idea and I could not convert the results list to dataframe. When I convert it, only obs. t test values, I can see with brrom::tidy. I tried almost a lot of code, but I could not convert all data to dataframe. I run the codes on the server, open ondemand. May be, acording the clustered nod, only 48 t value was produced. I'm nor sure. I will try again with perm.t.test. Have you ever run this snow code?

stamats commented 3 years ago

I haven't tried your code. But you could modify your code to

system.time(cl.boot <- parApply(cl, data, 1, function(x) { boot.t.testt(x[6:31], x[1:5], mu=0, alternative = "two.sided", var.equal = TRUE, R=999)$boot.p.value}))

Such that only the bootstrapped p values are returned and check whether the number of p values corresponds to the respective dimension (number of rows) of your data.