rhondabacher / SCnorm

Normalization for single cell RNA-seq data
47 stars 9 forks source link

SCnorm function starts new process across all cores #5

Open JakeHagen opened 7 years ago

JakeHagen commented 7 years ago

Hello I was trying to use SCnorm to normalize a data set (3005 cells). When calling the SCnorm function, R started using multiple cores. When it crashed it was using about 70 cores each with about 2.5gb memory. Below is the sessionInfo

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] SCnorm_1.1.0        reshape2_1.4.2      moments_0.14       
 [4] cluster_2.0.5       quantreg_5.29       SparseM_1.74       
 [7] scater_1.2.0        ggplot2_2.2.1       Biobase_2.34.0     
[10] BiocGenerics_0.20.0

loaded via a namespace (and not attached):
 [1] tximport_1.2.0       beeswarm_0.2.3       locfit_1.5-9.1      
 [4] lattice_0.20-34      rhdf5_2.18.0         colorspace_1.3-0    
 [7] htmltools_0.3.5      stats4_3.3.1         chron_2.3-47        
[10] XML_3.98-1.5         DBI_0.5-1            matrixStats_0.51.0  
[13] plyr_1.8.4           stringr_1.1.0        zlibbioc_1.20.0     
[16] MatrixModels_0.4-1   munsell_0.4.3        gtable_0.2.0        
[19] IRanges_2.8.1        biomaRt_2.30.0       httpuv_1.3.3        
[22] vipor_0.4.4          AnnotationDbi_1.36.0 Rcpp_0.12.7         
[25] xtable_1.8-2         edgeR_3.16.2         scales_0.4.1        
[28] limma_3.30.3         S4Vectors_0.12.0     mime_0.5            
[31] gridExtra_2.2.1      rjson_0.2.15         digest_0.6.11       
[34] stringi_1.1.2        dplyr_0.5.0          shiny_0.14.2        
[37] grid_3.3.1           tools_3.3.1          bitops_1.0-6        
[40] magrittr_1.5         lazyeval_0.2.0       RCurl_1.95-4.8      
[43] tibble_1.2           RSQLite_1.0.0        Matrix_1.2-7.1      
[46] data.table_1.9.6     ggbeeswarm_0.5.3     shinydashboard_0.5.3
[49] assertthat_0.1       viridis_0.3.4        R6_2.2.0 
rhondabacher commented 7 years ago

Hi Jake,

Thanks for trying out the SCnorm package. Were there any error messages when it crashed? I can take a more detailed look if you don't mind sending me the data so I can reproduce what happened.

JakeHagen commented 7 years ago

Hi Rhonda So it doesnt actually crash, it just takes up all resources. For example when I run it on my laptop, it uses one to eight cores at a time and usually all ram (16G) and swap(12G). And when I run it on our server it was using 70 cores and 256G ram before I had to kill it.

Since I can let it run on the laptop for longer, I noticed it would use the cores in ram in cycles. Sometimes it would max out all processors and ram, and sometimes it would use one 1 processor and about 6G ram.

I also tried testing with a cut down dataset, with no luck.

The link to the dataset is: https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE60361&format=filefile=GSE60361%5FC1%2D3005%2DExpression%2Etxt%2Egz

Sorry I cant provide anything else at this time, let me know if you would like me to test anything.

Thanks Jake

rhondabacher commented 7 years ago

Hi Jake,

Just sending an update that I found the area causing the memory issue. I'm doing a bit of testing to make sure it's fixed and runs smoothly. I'll send another message when the update is available to download.

Thanks, Rhonda

rhondabacher commented 7 years ago

Hi Jake,

I have updated SCnorm. The memory issue was a result of many ties going into quantreg(). Some minor changes resolved the memory issue and sped it up a bit.

I have also added a few suggestions to the vignette for UMI data as well. The first is you may consider a smaller proportion of genes to base the group regression on. The default is 25%, but reducing this to 10% can improve the speed. Genes are chosen as those closest to the group mode, the results should not change much.

After normalization, K is chosen based on a threshold on the normalized count-depth relationship. The default is .1, but can now be lowered if the K evaluation plot is not quite close enough to zero. I tried .05 for your dataset and it worked well.

I've tested these out quite a bit and have not run into any errors, however please let me know if you encounter any. Thanks again !

JakeHagen commented 7 years ago

Hi Rhonda

This is great. I should be able to test the update out sometime this week. Thank you

ShanSabri commented 7 years ago

I'm running into a similar issue. I'm running locally (16Gb) and the command DataNorm <- SCnorm(filtered.data, dbscan.plot.data$Timepoint, OutputName = "SCnorm.data", FilterCellNum = 10, PropToUse = .1, Thresh = .05, ditherCounts = TRUE) maxes out my ram. Is there a more efficient way to run this line of code?

EDIT: filtered.data is a data.frame with 10k cells and 16k genes. This is UMI-based data.

rhondabacher commented 7 years ago

Hi Shan,

Thanks for using SCnorm. While running did it get to the point of trying various values of K (there would be messages from SCnorm, i.e "Trying K = 1") ?

One thing I would try is to set the filter for non-zero median expression, you might try FilterExpression = 2. That might give some idea if it is the low expressors having so many tied counts that is causing the issue. If running locally you may also want to explicitly set the value of NCores, the default is your total number of cores - 1.

ShanSabri commented 7 years ago

Hi Rhonda,

Yes, SCnorm would output the various values of K that it was trying. I have time series data so I've set my categories to the various timepoint labels (5 timepoints in this case). It managed to run through the first timepoint up to K=5, and at that point I killed the process because runtime was already past 2 hours.

I will try your suggests now using the following command: DataNorm <- SCnorm(filtered.data, dbscan.plot.data$Timepoint, OutputName = "SCnorm.data", FilterCellNum = 10, FilterExpression = 2, PropToUse = .1, Thresh = .05, ditherCounts = TRUE, NCores = 5) I will report back with results.

Thank you, Shan

EDIT/UPDATE: Since posting (~2 hours ago), SCnorm has processed only till my second timepoint:

Jittering values introduces some randomness, for reproducibility set.seed(1) has been set.
Gene filter is applied within each condition.
4547 genes were not included in the normalization due to having less than 10 non-zero values.
4101 genes were not included in the normalization due to having less than 10 non-zero values.
4028 genes were not included in the normalization due to having less than 10 non-zero values.
3855 genes were not included in the normalization due to having less than 10 non-zero values.
3045 genes were not included in the normalization due to having less than 10 non-zero values.
A list of these genes can be accessed in output, see vignette for example.
Finding K for Condition Day 0
Trying K = 1
Trying K = 2
Trying K = 3
Trying K = 4
Trying K = 5
Trying K = 6
Trying K = 7
Trying K = 8
Finding K for Condition Day 3
Trying K = 1
Trying K = 2
Trying K = 3
Trying K = 4
Trying K = 5
Trying K = 6
Trying K = 7

Would it be more efficient to have less conditions? Treat all timepoints as one condition? or filter more aggressively?

rhondabacher commented 7 years ago

Hi Shan,

Sorry I didn't reply to this! I didn't get an email for the Edit/Update.

SCnorm will normalize each condition separately and then perform across condition scaling. So the impact on time will be the number of K searches (equal to the number of conditions) and the time for each K search. For comparison, SCnorm run-time with 4 cores for a dataset that has 180 cells with two conditions (two K searches) and about 16k genes takes about 15 minutes.

It's possible that reducing the number of conditions might reduce the run-time since there are less K searches. Or to reduce the time for each K search, you might consider increasing filters, for example the expression filter, although it's hard to give an exact guideline since the value you choose will probably be based on the distribution of counts in your data.

I'm currently investigating ways to increase the efficiency and reduce the run-time of SCnorm. I don't have a general solution implemented at this time but will leave this issue open as it is under development and a main priority.

Thanks, Rhonda

rhondabacher commented 7 years ago

Hi Shan,

I improved a number of aspects in the package that should improve the speed, I updated to v1.1.3. You might try this version.

Thanks! Rhonda

guangxujin commented 7 years ago

I met the same issue when running SCnorm and solved by using new parameters suggested by Shan. The new command is: n=2 DataNorm <- SCnorm(Data = ExampleSimSCData, Conditions = Conditions, PrintProgressPlots = TRUE, FilterCellNum = 10, FilterExpression = 1, PropToUse = .1, Thresh = .05, ditherCounts = TRUE,  NCores=n)

The issue on trying more Ks may be caused by the low expression. I used FilterExpression = 2, 14 Ks were tried with a warning: Finding K for Condition 1 Trying K = 1 Trying K = 2 Trying K = 3 Trying K = 4 Trying K = 5 Trying K = 6 Trying K = 7 Trying K = 8 Trying K = 9 Trying K = 10 Trying K = 11 Trying K = 12 Trying K = 13 Trying K = 14 Done! Warning message: In SCnorm(Data = ExampleSimSCData, Conditions = Conditions, PrintProgressPlots = TRUE, : At least one cell/sample has less than 10,000 counts total. Check the quality of your data or filtering criteria. SCnorm may not be appropriate for your data (see vignette for details).

whereas for FilterExpression = 1, only 3Ks Finding K for Condition 1 Trying K = 1 Trying K = 2 Trying K = 3 Done! Warning message: In SCnorm(Data = ExampleSimSCData, Conditions = Conditions, PrintProgressPlots = TRUE, : At least one cell/sample has less than 10,000 counts total. Check the quality of your data or filtering criteria. SCnorm may not be appropriate for your data (see vignette for details).

It is clear that the memory issue should come from the samples with lower expression levels.

Thanks for the great help.

Guangxu

rhondabacher commented 7 years ago

Thank you for your feedback on this! I will add these notes to the vignette FAQ so that this information is readily available.

-Rhonda