sandhya212 / BISCUIT_SingleCell_IMM_ICML_2016

R Codebase for BISCUIT: Infinite Mixture Model to cluster and impute single cells.
65 stars 33 forks source link

Calculating the Fiedler vector of the data (Error: cannot allocate vector of size 7.2 Gb) #18

Closed mdurante1 closed 5 years ago

mdurante1 commented 6 years ago

Hello,

I am attempting a run of the BISCUIT algorithm with a gene X cell matrix, obtained from Seurat, of dimensions ~40,000 X ~60,000. I am running this through the BISCUIT docker installation on Ubuntu 18.04 with a 32 core instance containing 256 GB of RAM. I am receiving the error below after the load data step runs for ~24 hours. It seems as if I need more RAM. I am able to run the test data and have no issues. Should I just increase the RAM of the instance or do you have any other recommendations? Can you provide any suggestions to resolve this error?

Thanks, Michael


Attaching package: 'snow'

The following objects are masked from 'package:parallel':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, makeCluster, parApply,
    parCapply, parLapply, parRapply, parSapply, splitIndices,
    stopCluster

Attaching package: 'bayesm'

The following object is masked from 'package:gtools':

    rdirichlet

The following object is masked from 'package:MCMCpack':

    rdirichlet

Attaching package: 'chron'

The following object is masked from 'package:foreach':

    times

[1] "Loading Data"
[1] "Calculating the Fiedler vector of the data"
Error: cannot allocate vector of size 7.2 Gb

UPDATE: I re-ran this matrix with the following settings (mostly default) and was able to get past this initial error:

input_data_tab_delimited <- TRUE; #set to TRUE if the input data is tab-delimited

is_format_genes_cells <-  TRUE; #set to TRUE if input data has rows as genes and columns as cells

choose_cells <- 3000; #comment if you want all the cells to be considered

choose_genes <- 150; #comment if you want all the genes to be considered

gene_batch <- 50; #number of genes per batch, therefore num_batches = choose_genes (or numgenes)/gene_batch. Max value is 150

num_iter <- 20; #number of iterations, choose based on data size.

num_cores <- detectCores() - 4; #number of cores for parallel processing. Ensure that detectCores() > 1 for parallel processing to work, else set num_cores to 1.

z_true_labels_avl <- FALSE; #set this to TRUE if the true labels of cells are available, else set it to FALSE. If TRUE, ensure to populate 'z_true' with the true labels in 'BISCUIT_process_data.R'

num_cells_batch <- 1000; #set this to 1000 if input number of cells is in the 1000s, else set it to 100.

alpha <- 1; #DPMM dispersion parameter. A higher value spins more clusters whereas a lower value spins lesser clusters.

output_folder_name <- "output"; #give a name for your output folder.

I then received a different error:

1] "Loading Data"
[1] "Calculating the Fiedler vector of the data"
[1] "Ensuring entire data is numeric and then log transforming it"
[1] "numcells is 3000"
[1] "numgenes is 150"
[1] "Number of gene batches is 3"
[1] "Number of gene subbatches is 3"
[1] "Ensuring user-specified data is numeric"
[1] "Computing t-sne projection of the data"
[1] "Monitor log.txt and outputs/plots/ folder for outputs"
[1] "floor(num_gene_batches/num_gene_sub_batches): 1"
[1] "MCMC begins"
[1] "Begin parallel processing of gene splits"
[1] "Beginning of batch  1"
Error in serialize(data, node$con) : error writing to connection

Do you have any suggestions to resolve this issue? Do you have any recommendations for settings so that I can include all cells and genes in the matrix (~40,000 X ~60,000) for the analysis?

sandhya212 commented 6 years ago

Hi Michael,

In this updated case, are you running this on a fresh Ubuntu instance? Given that you are only running BISCUIT for 3K cells and 150 genes, you would not have any memory issues.

Running the 40K genes: the first step with BISCUIT is computing the Fiedler vector which is on the 40K x 40K covariance matrix. You can give it more RAM, just run this step and revert back to a smaller instance to continue the run. Otherwise, you can use the standard deviation method to order genes. The code is in BISCUIT, just that it is commented. Let me know which option you prefer and I can point you to where this is in the code.

Which Docker installation is this?

mdurante1 commented 6 years ago

No, I re-ran this on the same Ubuntu instance as I did previously and only changed the number of cells and genes (for reference this updated analysis used 80-90 GB of RAM). Do you have any suggestions to help with the problem with "serialize"? A brief google search yielded other packages having this issue when relying on the 'foreach' and 'parallel' packages. I will try to re-run this updated case with 1 core to see if it yields the same error. I would like to resolve this error before running the 40k gene case so any other troubleshooting suggestions would be greatly appreciated.

I will initially attempt to run the 40K genes using more RAM. It would also be great to run, in tandem, the "standard deviation method" to order genes. Can you please point me to where I can find this code?

I believe I am running the latest Docker installation with the following command

docker run -it stevetsa/biscuit:latest

Thank you for your input it is greatly appreciated.

sandhya212 commented 6 years ago

You would need to re-run on a fresh instance so that all the used/unused parallel threads can be flushed out. This will definitely help.

For the sd method: In BISCUIT_process_data.R, uncomment lines 92-93 and comment out lines 112-123.

For the Docker runs, could you please DM me? I will put you in touch with the right person.

mdurante1 commented 6 years ago

I have DM'd you on twitter. Thank you for your continued help in troubleshooting these issues.

sandhya212 commented 6 years ago

Did not get the DM :-)

mdurante1 commented 6 years ago

Hmm I just tried again, my Twitter handle is @michaeldurante1 , if that doesn’t work is there a better way to DM you?

mdurante1 commented 6 years ago

Additionally I created a fresh instance and did a clean install using Docker and ran the 40,000 gene X 60,000 cell Matrix on a larger Memory instance. It ran for 6 days and then I received the following error:

[1] "Loading Data"
[1] "Calculating the Fiedler vector of the data"
[1] "Ensuring entire data is numeric and then log transforming it"
[1] "numcells is 56615"
[1] "numgenes is 33694"
[1] "Number of gene batches is 224"
[1] "Number of gene subbatches is 8"
[1] "Ensuring user-specified data is numeric"
[1] "Computing t-sne projection of the data"

 *** caught segfault ***
address (nil), cause 'memory not mapped'
Segmentation fault (core dumped)

When running this same 40K gene data set on our local compute cluster using a non-Docker install I received the following error:

 *** caught segfault ***
address 0xffffffffb162bb28, cause 'memory not mapped'
*** glibc detected *** /share/opt/R/3.3.1/lib64/R/bin/exec/R: free(): corrupted unsorted chunks: 0x000000041e26c830 ***

In both of these run cases there was plenty of RAM so that should not be an issue. It seems to be calculating the Fiedler vector of the data so I wouldn't think moving to a standard deviation method should have an impact on resolving this error. Can you provide any further guidance on how to resolve this issue?

FYI, The 3K gene dataset I mentioned previously is still running (1 Core) and I have not received any errors thus far. It made it past the "parallel processing of gene splits" step so there seems to be an issue allocating cores in my previous run where I included more cores. It has been on this step for ~2 days. Is this to be expected considering I'm only running 1 core?

[1] "Loading Data"
[1] "Calculating the Fiedler vector of the data"
[1] "Ensuring entire data is numeric and then log transforming it"
[1] "numcells is 3000"
[1] "numgenes is 150"
[1] "Number of gene batches is 3"
[1] "Number of gene subbatches is 3"
[1] "Ensuring user-specified data is numeric"
[1] "Computing t-sne projection of the data"
[1] "Monitor log.txt and outputs/plots/ folder for outputs"
[1] "floor(num_gene_batches/num_gene_sub_batches): 1"
[1] "MCMC begins"
[1] "Begin parallel processing of gene splits"
[1] "Beginning of batch  1"
sandhya212 commented 6 years ago

a) 3K gene dataset for 2 days - should not happen. We have always run on multiple cores so that the parallel processing kicks in. b) Segmentation faults are generally due to memory mapping failures but it is hard to detect where exactly this has happened. It is an error thrown by the OS. The reason we construct a Fiedler vector or perform std deviation on the genes is to get a gene ordering. In your case, if you want to take in all the genes, then you can also consider overriding this gene selection step completely.

mdurante1 commented 6 years ago

Do you have an Amazon Machine Image (or Azure Virtual Machine Image) or recommended install parameters that you have tested to work with datasets that are 40,000 gene X 60,000 cell or larger? I have multiple different compute resources that I can access to get my dataset to run successfully. I am happy to help debug and test some of the issues I have been encountering but I would like to compare to a resource that you have ran before and have had prior success running large datasets.

I also want to make sure my input data and start.file options are correct

I have set:

z_true_labels_avl <- FALSE;

and my data table is formatted as such:

(tab)Cell_1(tab)Cell_2(tab)...
Gene_1(tab)1.5(tab)3.3(tab)...
Gene_2(tab)7.2(tab)7.5(tab)...
...
PulverCyril commented 5 years ago

Hello, I'm posting in this thread because I get the same error as mdurante when running BISCUIT on a subsampled count matrix of 500 cells x 500 genes:

#######################BISCUIT########################### source("start_file.R") [1] "Loading Data" [1] "Calculating the Fiedler vector of the data" [1] "Ensuring entire data is numeric and then log transforming it" [1] "numcells is 500" [1] "numgenes is 150" [1] "Number of gene batches is 3" [1] "Number of gene subbatches is 3" [1] "Ensuring user-specified data is numeric" [1] "Computing t-sne projection of the data" [1] "Monitor log.txt and outputs/plots/ folder for outputs" [1] "floor(num_gene_batches/num_gene_sub_batches): 1" [1] "MCMC begins" [1] "Begin parallel processing of gene splits" [1] "Beginning of batch 1" Error in unserialize(socklist[[n]]) : error reading from connection

although BISCUIT runs like a charm for a subsampled count matrix of 100 cells and 100 genes, and for a subsampled count matrix of 500 cells and 150 genes.

It looks that any number of genes significantly above 150 in the subsampled count matrix causes this error to occur.

sandhya212 commented 5 years ago

This looks like there were not sufficient cores allocated for your run. Could you please mention where you ran this, how many cores you have etc. We have run BISCUIT on more than 150 genes without issues.

PulverCyril commented 5 years ago

I used 7 cores (on my personal computer), if you confirm that it is too few I will update this thread once I've set up BISCUIT on the cluster of my school. Thanks !

sandhya212 commented 5 years ago

Were you able to run it on your cluster with more cores?

melvinchin commented 5 years ago

I'm wondering if bopekno and mdurante1 have sorted out the problem - I too am getting the same error as them. I'm using a 16 core instance with 48G of RAM and the example dataset runs for 5-6 days without producing any output. It does not appear to progress after the "Beginning of batch 1" line as described by mdurante.

sandhya212 commented 5 years ago

I have not heard from bopekno but we are in touch with mdurante1. What are your dataset dimensions?

melvinchin commented 5 years ago

Thanks for the rapid response!

I am using the example dataset mentioned in the package - so 3K cells with approx 17K genes. From the previous comments above, it shouldn’t take that long to process, as you’ve mentioned.

One thing I have noticed is that when I look in the “Inferred labels per step per batch” folder, some batches do not have all their steps completed after parallelisation has kicked off.

On 14 Nov 2018, at 2:50 am, Sandhya Prabhakaran notifications@github.com wrote:

I have not heard from bopekno but we are in touch with mdurante1. What are your dataset dimensions?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sandhya212/BISCUIT_SingleCell_IMM_ICML_2016/issues/18#issuecomment-438391444, or mute the thread https://github.com/notifications/unsubscribe-auth/AlqICFrlGyurUKFbyiJwr3J7rC1ImMIWks5uuxRmgaJpZM4W-wCg.

sandhya212 commented 5 years ago

Yes, this should not run that long plus give you unpopulated folders. Where are you running this - machine/memory specifications?