Open atakanekiz opened 6 years ago
(best bug report ever, reprex
FTW.)
Towards your first question, the test statistic depends only on the order of the data, and is invariant with respect to any monotone transform. (c.f. the help page for bws_stat
https://github.com/shabbychef/BWStest/blob/master/man/bws_stat.Rd#L33-L36 ) If your test data only shows a limited number of arrangements, you will only see a few different values of the statistic.
Your larger question will take a little more time to digest.
That makes sense. In my case, it seems like I set up the bws
test incorrectly. If I'm understanding right, ties are happening when two genes have the same expression profile as in the following example:
set.seed(10)
gene_num <- 100
test <- data.frame(sample1=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
sample2=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
sample3=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
sample_a4=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
reference_b1=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)),
reference_b2=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)),
reference_b3=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)))
genes_row <- character()
for(i in 1:gene_num){
add_gene <- paste(sample(LETTERS,4), collapse = "")
genes_row <- c(genes_row, add_gene)
}
rownames(test) <- genes_row
##############################################################################
############### Use bws package for ranking ##################################
##############################################################################
library(BWStest)
bws_ranked <- numeric()
for(i in rownames(test)){
samp <- as.numeric(test[i,1:4])
ref <- as.numeric(test[i,5:7])
bws_ranked[[i]] <- bws_stat(samp, ref)
}
##############################################################################
bws_ranked[1:5] # Show first few results to see which ones are tied.
#> KFYP SCRH BIYU SCBN KDIA
#> 3.233424 3.233424 1.371321 3.233424 3.233424
test["KFYP",] # example ties
#> sample1 sample2 sample3 sample_a4 reference_b1 reference_b2
#> KFYP 50.18746 42.38196 62.15514 65.02545 19.20543 28.69475
#> reference_b3
#> KFYP 10.9803
test["SCRH",] # example ties
#> sample1 sample2 sample3 sample_a4 reference_b1 reference_b2
#> SCRH 48.15747 54.19375 53.30876 55.90409 31.81752 13.1999
#> reference_b3
#> SCRH 19.4525
Created on 2018-11-19 by the reprex package (v0.2.1)
In this example, genes KFYP
and SCRH
have similar expression profiles (high-high-high-high-low-low-low), resulting in a tie right?
My goal is a bit different though. I'd like to compare two groups and rank the genes ("high" expression in sample vs reference = high ranking). Would it be possible to use bws
in that context? The article I posted in my initial post utilizes BWS somehow, but I'm not sure how.
Best, Atakan
Found some relevant information about the paper I referenced in their github page. Unfortunately for me, it is in MATLAB
language, which I have no experience with. Now, I just need to find a way of writing a similar piece code in R to make it work in my analytical pipeline. I'm looking forward to hear your thoughts. Thanks so much!
....
case 'BWS'
ranks = zeros(1,m);
ind0 = 1:n0;
ind1 = 1:n1;
for a=1:m
rank_tmp = tiedrank(data(a,:));
R1 = sort(rank_tmp(gr0));
R2 = sort(rank_tmp(gr1));
B0 = sum((R1 - ((n1+n0)/n0 * ind0)).^2./((ind0./(n0+1)) .* (1-(ind0/(n0+1))) * ((n1*(n1+n0))/n0)));
B1 = sum((R2 - ((n1+n0)/n1 * ind1)).^2./((ind1./(n1+1)) .* (1-(ind1/(n1+1))) * ((n0*(n1+n0))/n1)));
ranks(a) = (B0/n0 + B1/n1)/2;
end
tail = 'right';
....
The following code computes the "7 choose 4" possible outcomes the BWS test statistic can take for the case of sample sizes 4 and 3:
test <- setNames(data.frame(samples=t(rbind(combn(1:7,4),8 - combn(1:7,3)))),
c(paste0('sample',1:3), 'sample_4a', paste0('reference_b',1:3)))
require(BWStest)
bws_ranked <- sapply(seq_len(nrow(test)),function(i) {
samp <- as.numeric(test[i,1:4])
ref <- as.numeric(test[i,5:7])
bws_stat(samp, ref)
})
I get these values:
c(2.9526703042328, 1.79592427248677, 1.00227347883598, 0.313781415343915,
0.921916335978836, 0.88322585978836, 0.512194113756614, 0.313781415343915,
0.0588210978835979, 0.387194113756614, 0.282035383597884, 0.207630621693122,
0.157035383597884, 0.231440145502646, 0.249297288359788, 0.532035383597884,
0.13322585978836, 0.228463955026455, 0.227471891534392, 0.46854332010582,
0.207630621693122, 0.217551256613757, 0.177868716931217, 0.24136078042328,
0.726479828042328, 0.619336970899471, 0.227471891534392, 0.38620205026455,
0.679852843915344, 1.0369957010582, 1.08560681216931, 1.26020998677249,
1.26020998677249, 2.04592427248677, 3.23342427248677)
Some of these are duplicated as well. The unique values among them are
c(2.9526703042328, 1.79592427248677, 1.00227347883598, 0.31378141534392,
0.92191633597884, 0.88322585978836, 0.51219411375661, 0.0588210978836,
0.38719411375661, 0.28203538359788, 0.20763062169312, 0.15703538359788,
0.23144014550265, 0.24929728835979, 0.53203538359788, 0.13322585978836,
0.22846395502646, 0.22747189153439, 0.46854332010582, 0.21755125661376,
0.17786871693122, 0.24136078042328, 0.72647982804233, 0.61933697089947,
0.38620205026455, 0.67985284391534, 1.0369957010582, 1.08560681216931,
1.26020998677249, 2.04592427248677, 3.23342427248677)
The code you have found in Matlab is computed the BWS tests statistic. The computation of B0
and B1
and ranks(a)
is essentially done here in C++ code: https://github.com/shabbychef/BWStest/blob/master/src/bws_stat.cpp#L117-L122
The Matlab code is computing this statistic row-by-row. The only odd Matlab syntax I see here is the use of 'dot operators' for Hadamard operations. That is a .* b
is Hadamard times, and a .^ b
would be Hadmard (elementwise) a
to the b
.
To be sure, I took the Matlab code above, and adapted it to the first row of my example output above, where the samples take values 1 through 4 and the reference takes 5 through 7. The code is as follows:
R1 = 1:4;
R2 = 5:7;
n0 = length(R1);
n1 = length(R2);
ind0 = 1:n0;
ind1 = 1:n1;
B0 = sum((R1 - ((n1+n0)/n0 * ind0)).^2./((ind0./(n0+1)) .* (1-(ind0/(n0+1))) * ((n1*(n1+n0))/n0)));
B1 = sum((R2 - ((n1+n0)/n1 * ind1)).^2./((ind1./(n1+1)) .* (1-(ind1/(n1+1))) * ((n0*(n1+n0))/n1)));
ranksa = (B0/n0 + B1/n1)/2;
ranksa
I executed this with the online octave interpreter where it claimed that the value of the statistic was 2.9527, which matches the value computed above.
Thanks for your comments. I distilled my question a bit more by using the following example. I simulated a data frame having 7 observations (4 sample
, 3 reference
) and 5 genes
. Between the ref and the sample, the ordering is the same for each gene (ref1<2<3<samp1<2<3<4
), but the magnitude of differences are different (10:1, 10:3, 10:5, 10:7, 10:9
). I expect the genes changing at higher magnitudes to be ranked first. This ranking is important for my real world application where genes that are changing more than others will mean something important biologically. Even though the row-wise (gene-wise) BWS computes with no problem, the output isn't informative.
test <- data.frame(rbind(c(10.1, 10.2, 10.3, 10.4, 1.1, 1.2, 1.3), # Should be ranked first (highest difference sample vs ref)
c(10.1, 10.2, 10.3, 10.4, 3.1, 3.2, 3.3), # Should be ranked second
c(10.1, 10.2, 10.3, 10.4, 5.1, 5.2, 5.3), # Should be ranked third
c(10.1, 10.2, 10.3, 10.4, 7.1, 7.2, 7.3), # Should be ranked fourth
c(10.1, 10.2, 10.3, 10.4, 9.1, 9.2, 9.3))) # Should be ranked last (lowest difference sample vs ref)
# Give mock names to the data frame
genes_row <- character()
for(i in 1:5){
add_gene <- paste(sample(LETTERS,4), collapse = "")
genes_row <- c(genes_row, add_gene)
}
rownames(test) <- genes_row
colnames(test) <- c(paste0("sample",1:4), paste0("ref",1:3))
pheatmap::pheatmap(test, cluster_rows = F, cluster_cols = F)
# Compute BWS statistics
require(BWStest)
#> Loading required package: BWStest
bws_ranked <- sapply(seq_len(nrow(test)),function(i) {
samp <- as.numeric(test[i,1:4])
ref <- as.numeric(test[i,5:7])
bws_stat(samp, ref)
})
names(bws_ranked)<-rownames(test)
# All the statistics are the same
bws_ranked
#> CRVX RUAO SNDE CEIV HUNZ
#> 3.233424 3.233424 3.233424 3.233424 3.233424
Created on 2018-11-20 by the reprex package (v0.2.1)
The BWS test is a test of equal probability distributions, not a test of location. Moreover, as noted in the documentation, and in the original paper, the test is invariant with respect to monotonic transforms of the data, so for example
require(BWStest)
set.seed(1234)
x <- 10 + runif(5)
y <- 5 + runif(5)
s1 <- bws_stat(x,y)
s2 <- bws_stat(log(x),log(y))
s3 <- bws_stat(exp(x),exp(y))
dput(c(s1,s2,s3))
I get
c(4.894, 4.894, 4.894)
But now note that a monotonic transform could 'scrunch up' the data between their split:
fnc <- function(x) { 10 - 1 / x }
fnc(x)
fnc(y)
bws_stat(fnc(x),fnc(y))
[1] 4.894
Perhaps you should be using a robust test for location?
I'm trying various methods to see which one is the most appropriate one. For the mock example I gave above, a t-test would work just fine (especially with many observations, normality assumption may not be an issue either). The data I'm trying to analyze comes from single-cell RNA sequencing technology where many reads are zeros (see figure below)
The published paper I included in my previous post found BWS test robust to this type of a problem in a different experimental setup (microarray), that's why I thought it might be appropriate for scSeq. I should dive a bit deeper to have a better understanding I think.
Hello,
Thanks for this helpful package. I'm trying to use this package to analyze for biological gene expression data. BWS method was suggested to be a powerful approach for non-parametric analysis of gene expression data this, this and this articles
My end goal is comparing two groups (
sample
vsreference
) with multiple observations in each case (cell_a1/2/3/4
,cell_b1/2/3
) and rank the gene expression for further analysis. I'm using different metrics to rank the gene expression and ranking the genes based on the statistics output of the relevant approach (i.e.t-value
fort.test
, andB
value forBWS
). Simulated data below helps demonstrate my question. The mockdata.frame
contains gene expression data in rows (each row is a gene) and samples in the columns (4 experimental samples, 3 references).My question has two facets:
1- Why do I have many duplicated BWS statistics value in this mock data frame even though the data were sampled randomly?
2- Any ideas why I have a low correlation between BWS method and the other the comparison methods? As a reference,
S2N
(signal to noise) metric is commonly used for the type of analysis that I'm trying perform here. The problem with using S2N in my actual case is the non-normal distribution of expression data. In reality, I have whole bunch of zero values and some non-zero values. Since the data aren't distributed normally, S2N metric might not be the most appropriate method (S2N is calculated as (mean1-mean2)/(stdev1+stdev2))Your thoughts are greatly appreciated.
Best, Atakan
Created on 2018-11-19 by the reprex package (v0.2.1)