shabbychef / BWStest

R package for the Baumgartner Weiss Schindler 2-sample test of equal probability distributions.
2 stars 1 forks source link

Understanding the BWS results #5

Open atakanekiz opened 5 years ago

atakanekiz commented 5 years ago

Hello,

Thanks for this helpful package. I'm trying to use this package to analyze for biological gene expression data. BWS method was suggested to be a powerful approach for non-parametric analysis of gene expression data this, this and this articles

My end goal is comparing two groups (sample vs reference) with multiple observations in each case (cell_a1/2/3/4, cell_b1/2/3) and rank the gene expression for further analysis. I'm using different metrics to rank the gene expression and ranking the genes based on the statistics output of the relevant approach (i.e. t-value for t.test, and B value for BWS). Simulated data below helps demonstrate my question. The mock data.frame contains gene expression data in rows (each row is a gene) and samples in the columns (4 experimental samples, 3 references).

My question has two facets:

1- Why do I have many duplicated BWS statistics value in this mock data frame even though the data were sampled randomly?

2- Any ideas why I have a low correlation between BWS method and the other the comparison methods? As a reference, S2N (signal to noise) metric is commonly used for the type of analysis that I'm trying perform here. The problem with using S2N in my actual case is the non-normal distribution of expression data. In reality, I have whole bunch of zero values and some non-zero values. Since the data aren't distributed normally, S2N metric might not be the most appropriate method (S2N is calculated as (mean1-mean2)/(stdev1+stdev2))

Your thoughts are greatly appreciated.

Best, Atakan

set.seed(10)

gene_num <- 100

test <- data.frame(sample1=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
                   sample2=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
                   sample3=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
                   sample_a4=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
                   reference_b1=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)),
                   reference_b2=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)),
                   reference_b3=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)))

genes_row <- character()

for(i in 1:gene_num){

  add_gene <- paste(sample(LETTERS,4), collapse = "")

  genes_row <- c(genes_row, add_gene)

  }

rownames(test) <- genes_row

pheatmap::pheatmap(test)


# S2N ranking
##############################################################################
grp1 <- test[,1:4]
grp2 <- test[,5:7]

grp1_means <- rowMeans(grp1)
grp2_means <- rowMeans(grp2)

grp1_sd <- apply(grp1, 1, sd)
grp2_sd <- apply(grp2, 1, sd)

s2n_ranked <- (grp1_means - grp2_means) / (grp1_sd + grp2_sd)

s2n_ranked <- sort(s2n_ranked, decreasing = T)
# s2n_ranked

# tTest ranking
##############################################################################
grp1 <- test[,1:4]
grp2 <- test[,5:7]

grp1_means <- rowMeans(grp1)
grp2_means <- rowMeans(grp2)

grp1_sd <- apply(grp1, 1, sd)
grp2_sd <- apply(grp2, 1, sd)

grp1_n <- dim(grp1)[2]
grp2_n <- dim(grp2)[2]

ttest_ranked <- (grp1_means - grp2_means) / sqrt((grp1_sd^2/grp1_n)+(grp2_sd^2/grp2_n))
ttest_ranked <- sort(ttest_ranked, decreasing = T)
# ttest_ranked

# Ratio ranking (Recommended for log scale data)
##############################################################################
grp1 <- test[,1:4]
grp2 <- test[,5:7]

grp1_means <- rowMeans(grp1)
grp2_means <- rowMeans(grp2)

ratio_ranked <- grp1_means / grp2_means
ratio_ranked <- sort(ratio_ranked, decreasing = T)
# ratio_ranked

# Difference ranking (Recomended for natural scale data)
##############################################################################
grp1 <- test[,1:4]
grp2 <- test[,5:7]

grp1_means <- rowMeans(grp1)
grp2_means <- rowMeans(grp2)

difference_ranked <- grp1_means - grp2_means
difference_ranked <- sort(difference_ranked, decreasing = T)
# difference_ranked

# log2ratio ranking (Recomended for natural scale data)
##############################################################################
grp1 <- test[,1:4]
grp2 <- test[,5:7]

grp1_means <- rowMeans(grp1)
grp2_means <- rowMeans(grp2)

log2ratio_ranked <- log((grp1_means / grp2_means),2)
log2ratio_ranked <- sort(log2ratio_ranked, decreasing = T)
# log2ratio_ranked

##############################################################################
############### Use GeneSelector package for ranking #########################
##############################################################################

library(GeneSelector)
#> Loading required package: Biobase
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind,
#>     colMeans, colnames, colSums, dirname, do.call, duplicated,
#>     eval, evalq, Filter, Find, get, grep, grepl, intersect,
#>     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
#>     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
#>     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
#>     table, tapply, union, unique, unsplit, which, which.max,
#>     which.min
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Warning: Package 'GeneSelector' is deprecated and will be removed from
#>   Bioconductor version 3.9

GeneSelector_tstat <- RankingTstat(x= as.matrix(test), y = factor(c(rep(1, times=4), rep(2, times=3))))
GeneSelector_tstat_res <- sort(GeneSelector_tstat@ranking, decreasing = F)

GeneSelector_welch <- RankingWelchT(x= as.matrix(test), y = factor(c(rep(1, times=4), rep(2, times=3))))
GeneSelector_welch_res <- sort(GeneSelector_welch@ranking, decreasing = F)

GeneSelector_difference <- RankingFC(x= as.matrix(test), y = factor(c(rep(1, times=4), rep(2, times=3))))
GeneSelector_difference_res <- sort(GeneSelector_difference@ranking, decreasing = F)

##############################################################################
############### Use mwt package for ranking ##################################
##############################################################################

library(mwt)
#> Loading required package: limma
#> 
#> Attaching package: 'limma'
#> The following object is masked from 'package:BiocGenerics':
#> 
#>     plotMA
#> Loading required package: OCplus

mwt_res <- mwt(as.matrix(test), grp = factor(c(1,1,1,1,2,2,2)))

##############################################################################
############### Use bws package for ranking ##################################
##############################################################################
library(BWStest)

bws_ranked <- numeric()

for(i in rownames(test)){

  samp <- as.numeric(test[i,1:4])
  ref <- as.numeric(test[i,5:7])

  bws_ranked[[i]] <- bws_stat(samp, ref)

}

##############################################################################
############################# Visualization ##################################
##############################################################################

# Compare results

res_list<-list(GeneSelector_difference = GeneSelector_difference@statistic, 
               GeneSelector_welch = GeneSelector_welch@statistic,
               GeneSelector_tstat = GeneSelector_tstat@statistic, 
               difference_ranked = difference_ranked, 
               ttest_ranked = ttest_ranked, 
               s2n_ranked= s2n_ranked,
               ratio_ranked = ratio_ranked, 
               log2ratio_ranked = log2ratio_ranked,
               mwt_res = mwt_res$MWT,
               bws_ranked = bws_ranked)

combined_res <- do.call(cbind.data.frame, res_list)

library(corrplot)
#> corrplot 0.84 loaded
library(RColorBrewer)
M <-cor(combined_res)
corrplot(M, type="upper", order="hclust",
         col=brewer.pal(n=10, name="RdYlBu"), method = "pie")


# Show ties in BWS statistics

sum(duplicated(bws_ranked))
#> [1] 87

Created on 2018-11-19 by the reprex package (v0.2.1)

shabbybanks commented 5 years ago

(best bug report ever, reprex FTW.)

Towards your first question, the test statistic depends only on the order of the data, and is invariant with respect to any monotone transform. (c.f. the help page for bws_stat https://github.com/shabbychef/BWStest/blob/master/man/bws_stat.Rd#L33-L36 ) If your test data only shows a limited number of arrangements, you will only see a few different values of the statistic. Your larger question will take a little more time to digest.

atakanekiz commented 5 years ago

That makes sense. In my case, it seems like I set up the bws test incorrectly. If I'm understanding right, ties are happening when two genes have the same expression profile as in the following example:

set.seed(10)

gene_num <- 100

test <- data.frame(sample1=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
                   sample2=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
                   sample3=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
                   sample_a4=c(rnorm(gene_num/2, mean=50, sd=10), rnorm(gene_num/2, mean=20, sd=10)),
                   reference_b1=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)),
                   reference_b2=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)),
                   reference_b3=c(rnorm(gene_num/2, mean=20, sd=10), rnorm(gene_num/2, mean=40, sd=10)))

genes_row <- character()

for(i in 1:gene_num){

  add_gene <- paste(sample(LETTERS,4), collapse = "")

  genes_row <- c(genes_row, add_gene)

}

rownames(test) <- genes_row

##############################################################################
############### Use bws package for ranking ##################################
##############################################################################
library(BWStest)

bws_ranked <- numeric()

for(i in rownames(test)){

  samp <- as.numeric(test[i,1:4])
  ref <- as.numeric(test[i,5:7])

  bws_ranked[[i]] <- bws_stat(samp, ref)

}

##############################################################################

bws_ranked[1:5] # Show first few results to see which ones are tied.
#>     KFYP     SCRH     BIYU     SCBN     KDIA 
#> 3.233424 3.233424 1.371321 3.233424 3.233424

test["KFYP",] # example ties
#>       sample1  sample2  sample3 sample_a4 reference_b1 reference_b2
#> KFYP 50.18746 42.38196 62.15514  65.02545     19.20543     28.69475
#>      reference_b3
#> KFYP      10.9803

test["SCRH",] # example ties
#>       sample1  sample2  sample3 sample_a4 reference_b1 reference_b2
#> SCRH 48.15747 54.19375 53.30876  55.90409     31.81752      13.1999
#>      reference_b3
#> SCRH      19.4525

Created on 2018-11-19 by the reprex package (v0.2.1)

In this example, genes KFYP and SCRH have similar expression profiles (high-high-high-high-low-low-low), resulting in a tie right?

My goal is a bit different though. I'd like to compare two groups and rank the genes ("high" expression in sample vs reference = high ranking). Would it be possible to use bws in that context? The article I posted in my initial post utilizes BWS somehow, but I'm not sure how.

Best, Atakan

atakanekiz commented 5 years ago

Found some relevant information about the paper I referenced in their github page. Unfortunately for me, it is in MATLAB language, which I have no experience with. Now, I just need to find a way of writing a similar piece code in R to make it work in my analytical pipeline. I'm looking forward to hear your thoughts. Thanks so much!

....
case 'BWS'
    ranks = zeros(1,m);
    ind0 = 1:n0;
    ind1 = 1:n1;
    for a=1:m
        rank_tmp = tiedrank(data(a,:));
        R1 = sort(rank_tmp(gr0));
        R2 = sort(rank_tmp(gr1));
        B0 = sum((R1 - ((n1+n0)/n0 * ind0)).^2./((ind0./(n0+1)) .* (1-(ind0/(n0+1))) * ((n1*(n1+n0))/n0)));
        B1 = sum((R2 - ((n1+n0)/n1 * ind1)).^2./((ind1./(n1+1)) .* (1-(ind1/(n1+1))) * ((n0*(n1+n0))/n1)));
        ranks(a) = (B0/n0 + B1/n1)/2;
    end
    tail = 'right';
....
shabbychef commented 5 years ago

The following code computes the "7 choose 4" possible outcomes the BWS test statistic can take for the case of sample sizes 4 and 3:

test <- setNames(data.frame(samples=t(rbind(combn(1:7,4),8 - combn(1:7,3)))),
            c(paste0('sample',1:3), 'sample_4a', paste0('reference_b',1:3)))

require(BWStest)
bws_ranked <- sapply(seq_len(nrow(test)),function(i) {
  samp <- as.numeric(test[i,1:4])
  ref <- as.numeric(test[i,5:7])
  bws_stat(samp, ref)
})

I get these values:

c(2.9526703042328, 1.79592427248677, 1.00227347883598, 0.313781415343915, 
0.921916335978836, 0.88322585978836, 0.512194113756614, 0.313781415343915, 
0.0588210978835979, 0.387194113756614, 0.282035383597884, 0.207630621693122, 
0.157035383597884, 0.231440145502646, 0.249297288359788, 0.532035383597884, 
0.13322585978836, 0.228463955026455, 0.227471891534392, 0.46854332010582, 
0.207630621693122, 0.217551256613757, 0.177868716931217, 0.24136078042328, 
0.726479828042328, 0.619336970899471, 0.227471891534392, 0.38620205026455, 
0.679852843915344, 1.0369957010582, 1.08560681216931, 1.26020998677249, 
1.26020998677249, 2.04592427248677, 3.23342427248677)

Some of these are duplicated as well. The unique values among them are

c(2.9526703042328, 1.79592427248677, 1.00227347883598, 0.31378141534392, 
0.92191633597884, 0.88322585978836, 0.51219411375661, 0.0588210978836, 
0.38719411375661, 0.28203538359788, 0.20763062169312, 0.15703538359788, 
0.23144014550265, 0.24929728835979, 0.53203538359788, 0.13322585978836, 
0.22846395502646, 0.22747189153439, 0.46854332010582, 0.21755125661376, 
0.17786871693122, 0.24136078042328, 0.72647982804233, 0.61933697089947, 
0.38620205026455, 0.67985284391534, 1.0369957010582, 1.08560681216931, 
1.26020998677249, 2.04592427248677, 3.23342427248677)
shabbychef commented 5 years ago

The code you have found in Matlab is computed the BWS tests statistic. The computation of B0 and B1 and ranks(a) is essentially done here in C++ code: https://github.com/shabbychef/BWStest/blob/master/src/bws_stat.cpp#L117-L122

The Matlab code is computing this statistic row-by-row. The only odd Matlab syntax I see here is the use of 'dot operators' for Hadamard operations. That is a .* b is Hadamard times, and a .^ b would be Hadmard (elementwise) a to the b.

To be sure, I took the Matlab code above, and adapted it to the first row of my example output above, where the samples take values 1 through 4 and the reference takes 5 through 7. The code is as follows:

R1 = 1:4;
R2 = 5:7;
n0 = length(R1);
n1 = length(R2);
ind0 = 1:n0;
ind1 = 1:n1;
B0 = sum((R1 - ((n1+n0)/n0 * ind0)).^2./((ind0./(n0+1)) .* (1-(ind0/(n0+1))) * ((n1*(n1+n0))/n0)));
B1 = sum((R2 - ((n1+n0)/n1 * ind1)).^2./((ind1./(n1+1)) .* (1-(ind1/(n1+1))) * ((n0*(n1+n0))/n1)));
ranksa = (B0/n0 + B1/n1)/2;

ranksa

I executed this with the online octave interpreter where it claimed that the value of the statistic was 2.9527, which matches the value computed above.

atakanekiz commented 5 years ago

Thanks for your comments. I distilled my question a bit more by using the following example. I simulated a data frame having 7 observations (4 sample, 3 reference) and 5 genes. Between the ref and the sample, the ordering is the same for each gene (ref1<2<3<samp1<2<3<4), but the magnitude of differences are different (10:1, 10:3, 10:5, 10:7, 10:9). I expect the genes changing at higher magnitudes to be ranked first. This ranking is important for my real world application where genes that are changing more than others will mean something important biologically. Even though the row-wise (gene-wise) BWS computes with no problem, the output isn't informative.

test <- data.frame(rbind(c(10.1, 10.2, 10.3, 10.4, 1.1, 1.2, 1.3),   # Should be ranked first (highest difference sample vs ref)
                         c(10.1, 10.2, 10.3, 10.4, 3.1, 3.2, 3.3),   # Should be ranked second
                         c(10.1, 10.2, 10.3, 10.4, 5.1, 5.2, 5.3),   # Should be ranked third
                         c(10.1, 10.2, 10.3, 10.4, 7.1, 7.2, 7.3),   # Should be ranked fourth
                         c(10.1, 10.2, 10.3, 10.4, 9.1, 9.2, 9.3)))  # Should be ranked last (lowest difference sample vs ref)

# Give mock names to the data frame
genes_row <- character()

for(i in 1:5){
  add_gene <- paste(sample(LETTERS,4), collapse = "")
  genes_row <- c(genes_row, add_gene)
}
rownames(test) <- genes_row

colnames(test) <- c(paste0("sample",1:4), paste0("ref",1:3))

pheatmap::pheatmap(test, cluster_rows = F, cluster_cols = F)


# Compute BWS statistics
require(BWStest)
#> Loading required package: BWStest
bws_ranked <- sapply(seq_len(nrow(test)),function(i) {
  samp <- as.numeric(test[i,1:4])
  ref <- as.numeric(test[i,5:7])
  bws_stat(samp, ref)
})
names(bws_ranked)<-rownames(test)

# All the statistics are the same
bws_ranked
#>     CRVX     RUAO     SNDE     CEIV     HUNZ 
#> 3.233424 3.233424 3.233424 3.233424 3.233424

Created on 2018-11-20 by the reprex package (v0.2.1)

shabbybanks commented 5 years ago

The BWS test is a test of equal probability distributions, not a test of location. Moreover, as noted in the documentation, and in the original paper, the test is invariant with respect to monotonic transforms of the data, so for example

require(BWStest)
set.seed(1234)
x <- 10 + runif(5)
y <- 5 + runif(5)
s1 <- bws_stat(x,y)
s2 <- bws_stat(log(x),log(y))
s3 <- bws_stat(exp(x),exp(y))
dput(c(s1,s2,s3))

I get

c(4.894, 4.894, 4.894)

But now note that a monotonic transform could 'scrunch up' the data between their split:

fnc <- function(x) { 10 - 1 / x }
fnc(x)
fnc(y)
bws_stat(fnc(x),fnc(y))
[1] 4.894

Perhaps you should be using a robust test for location?

atakanekiz commented 5 years ago

I'm trying various methods to see which one is the most appropriate one. For the mock example I gave above, a t-test would work just fine (especially with many observations, normality assumption may not be an issue either). The data I'm trying to analyze comes from single-cell RNA sequencing technology where many reads are zeros (see figure below) image

The published paper I included in my previous post found BWS test robust to this type of a problem in a different experimental setup (microarray), that's why I thought it might be appropriate for scSeq. I should dive a bit deeper to have a better understanding I think.