ocbe-uio / DIscBIO

A user-friendly R pipeline for biomarker discovery in single-cell transcriptomics
Other
12 stars 5 forks source link

Freezing on DEGanalysis #14

Closed wleoncio closed 4 years ago

wleoncio commented 4 years ago

Describe the bug

There are some circumstances that cause the DEGanalysis function to hang.

To Reproduce

  1. Download the pan_indrop_matrix_8000cells_18556genes dataset
  2. From the download directory, run R
  3. Run the following code in R (it may take a few minutes):
    
    library(DIscBIO)
    load("data/pan_indrop_matrix_8000cells_18556genes.rda")

==============================================================================

Determining contants

==============================================================================

n_genes <- 500 K <- 3

==============================================================================

Subsetting and formatting datasets

==============================================================================

sc_dataframe <- pan_indrop_matrix_8000cells_18556genes[, seq_len(n_genes)] sc <- DISCBIO(sc_dataframe)

==============================================================================

Performing operations

==============================================================================

MIinExp <- mean(rowMeans(sc_dataframe, na.rm=TRUE)) MinNumber <- round(length(sc_dataframe[1, ]) / 10) sc <- Normalizedata( sc, mintotal=1000, minexpr=MIinExp, minnumber=MinNumber, maxexpr=Inf, downsample=FALSE, dsn=1, rseed=17000 ) sc <- FinalPreprocessing(sc, GeneFlitering="ExpF", export=FALSE, quiet=TRUE) sc <- Clustexp(sc, cln=K, quiet=TRUE) sc <- comptSNE(sc, rseed=15555, quiet=TRUE)

This is the part that freezes

cdiff <- DEGanalysis( sc, Clustering="K-means", K=K, fdr=0.10, name="all_clusters", export=FALSE, quiet=FALSE, plot=FALSE, nresamp=5, nperms=10 )


4. Observe that the code freezes with the output of `DEGanalysis` being:

The dataset is ready for differential expression analysis[1] "Cl2" "Cl1" "Cl3" Number of comparisons: 6 Estimating sequencing depths... Resampling to get new data matrices... perm= 1 perm= 2 perm= 3 perm= 4 perm= 5 perm= 6 perm= 7 perm= 8 perm= 9 perm= 10 Number of thresholds chosen (all possible thresholds) = 1283 Getting all the cutoffs for the thresholds... Getting number of false positives in the permutation... 'select()' returned 1:many mapping between keys and columns Low-regulated genes in the Cl1 in Cl2 VS Cl1 'select()' returned 1:many mapping between keys and columns Up-regulated genes in the Cl1 in Cl2 VS Cl1 Estimating sequencing depths... Resampling to get new data matrices...


<kbd>Ctrl</kbd>+<kbd>C</kbd> doesn't quit the function, only killing R does the trick.

# Expected behavior
For a simpler set of parameters, for example `n_genes <- 100` and `K <- 2`, the code above ends with the following output:

Up-regulated genes in the Cl1 in Cl2 VS Cl1 Comparisons Target cluster Gene number File name Gene number File name 1 Cl2 VS Cl1 Cl1 477 Up-regulated-all_clustersCl1inCl2VSCl1.csv 941 Low-regulated-all_clustersCl1inCl2VSCl1.csv 2 Cl2 VS Cl1 Cl2 477 Low-regulated-all_clustersCl2inCl2VSCl1.csv 941 Up-regulated-all_clustersCl2inCl2VSCl1.csv

Moreover, the structure of `cdiff` is:

List of 2 $ : chr [1:1422, 1:2] "ENSG00000005022" "ENSG00000006327" "ENSG00000008394" "ENSG00000008517" ... ..- attr(*, "dimnames")=List of 2 .. ..$ : NULL .. ..$ : chr [1:2] "DEGsE" "DEGsS" $ :'data.frame': 2 obs. of 6 variables: ..$ Comparisons : chr [1:2] "Cl2 VS Cl1" "Cl2 VS Cl1" ..$ Target cluster: chr [1:2] "Cl1" "Cl2" ..$ Gene number : int [1:2] 477 477 ..$ File name : chr [1:2] "Up-regulated-all_clustersCl1inCl2VSCl1.csv" "Low-regulated-all_clustersCl2inCl2VSCl1.csv" ..$ Gene number : int [1:2] 941 941 ..$ File name : chr [1:2] "Low-regulated-all_clustersCl1inCl2VSCl1.csv" "Up-regulated-all_clustersCl2inCl2VSCl1.csv"



# Software metainformation
- Operating System: Ubuntu 18.04
- R version: 4.0.0
- DIscBIO version or commit number: 0.99.6
wleoncio commented 4 years ago

Fixed on caa84d9e48e6499ad41e51601aca9a2783b65ba8 (dev branch). Bug was caused by a Fortran-compiled function in the SAMR package. It was replaced by an equivalent (and much faster) R function. Closing.