Matlab crash while glmnetMex.mexmaci64 was running

ElenaMerinoTejero commented 2 years ago

I am trying to run the SINGE_Example.m in MATLABR2020a on macOS Catalina.

ver -support

MATLAB Version: 9.8.0.1873465 (R2020a) Update 8 MATLAB License Number: 40707400 Operating System: Mac OS X Version: 10.15.7 Build: 19H1824 Java Version: Java 1.8.0_202-b08 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode

MATLAB Version 9.8 (R2020a) License 40707400

I get the following Warning message:

Warning: from glmnet Fortran code (error code -5); Convergence for 5th lambda value not reached after maxit=10000 iterations; solutions for larger lambdas returned

In elnet (line 33) In glmnet (line 443) In iLasso_for_SINGE (line 111) In run_iLasso_row (line 27) In SINGE_GLG_Test (line 79) In SINGE (line 20) In SINGE_Example (line 16)

After several iterations MATLAB crashes. According to MathWorks technical support, the crash was detected while the MEX-file glmnetMex.mexmaci64 was running. Any suggestion to solve this issue?

agitter commented 2 years ago

Thanks for letting us know @ElenaMerinoTejero. Can you please tell us which version glmnet you are using? I see that https://hastie.su.domains/glmnet_matlab/download.html now lists glmnet_matlab.zip as well as glmnet_matlab_new.zip.

atuldeshpande commented 2 years ago

Thanks for bringing this to our notice @ElenaMerinoTejero.

Could you please share also more details of the hyperparameters file you are using to run SINGE as well as the size of the data matrix in number of genes and cells? If you have access to docker, can you also try and run SINGE from its docker implementation? There is an ongoing issue with a potential memory leak in the glmnetMex code which causes the crashes for larger datasets. We also noticed that this issue is more frequent with the "poisson" distribution option compared to the "gaussian" distribution. The following general strategies may help, but based on additional information, I could give you more targeted suggestions.

Reducing the number of potential regulators to make the glmnet input size. This can be achieved either by retaining only the important genes in the dataset, or using the regix argument where you specify the indices of the genes to be tested as regulators (please see https://github.com/gitter-lab/SINGE/blob/master/data1/X_regix_test.mat for an example).
Increasing the values of --prob-zero-removal 0 --prob-remove-sample 0.2 to also reduce the glmnet input size. Especially for sparse data sets, we observed that prob-zero-removal do not impact SINGE performance greatly.

ElenaMerinoTejero commented 2 years ago

Thanks for letting us know @ElenaMerinoTejero. Can you please tell us which version glmnet you are using? I see that https://hastie.su.domains/glmnet_matlab/download.html now lists glmnet_matlab.zip as well as glmnet_matlab_new.zip.

Thanks to you for the fast reply! I am using the updated glmnet version. I think that is the glmnet_matlab_new.zip.

ElenaMerinoTejero commented 2 years ago

Thanks for bringing this to our notice @ElenaMerinoTejero.

Could you please share also more details of the hyperparameters file you are using to run SINGE as well as the size of the data matrix in number of genes and cells? If you have access to docker, can you also try and run SINGE from its docker implementation? There is an ongoing issue with a potential memory leak in the glmnetMex code which causes the crashes for larger datasets. We also noticed that this issue is more frequent with the "poisson" distribution option compared to the "gaussian" distribution. The following general strategies may help, but based on additional information, I could give you more targeted suggestions.

Reducing the number of potential regulators to make the glmnet input size. This can be achieved either by retaining only the important genes in the dataset, or using the regix argument where you specify the indices of the genes to be tested as regulators (please see https://github.com/gitter-lab/SINGE/blob/master/data1/X_regix_test.mat for an example).

Increasing the values of --prob-zero-removal 0 --prob-remove-sample 0.2 to also reduce the glmnet input size. Especially for sparse data sets, we observed that prob-zero-removal do not impact SINGE performance greatly.

Thanks to you too, for your fast replying.

For now I am trying to running the SINGE_Example.m which takes as hyperparameters those in 'default_hyperparameters.txt', as data 'data1/X_SCODE_data.mat' (with 356 cells) and as gene list 'data1/gene_list.mat' (with 100 genes). Furthermore, the "gaussian" distribution is used in the example.

default_hyperparameters.txt

Nevertheless, I am planning to run SINGE with a larger dataset (33694 genes and 737280 single cells) so I could use your suggestions then, thanks.

atuldeshpande commented 2 years ago

That's actually one of the most stable test cases we have run. Would it be possible for you to test the docker implementation at https://hub.docker.com/r/agitter/singe? This would remove the OS and the Matlab version as variables and help us better diagnose if the problem still persists.

Regarding the larger dataset: I would strongly advise on limiting the genes to a much smaller number, and potentially also subsampling the cells at a much higher rate. I understand you are currently trying SINGE out on a personal computing device, but for the larger datasets, you would also want to use a high throughput computing server to speed up the analysis.

ElenaMerinoTejero commented 2 years ago

Thanks for the suggestion! I am running the docker implementation and it doesn't crash now and produces output files. Nevertheless, the Warning message persists. I am unsure if it affects the output. Is there a way to compare my output with the expected one for the SINGE_Example?

In any case, I will take your advice about reducing the data set and trying it out on a server for higher speed.

agitter commented 2 years ago

We have formal test cases you can use to confirm the SINGE_Example output matches the expected output. However, they use a smaller set of hyperparameters so that they run quickly on GitHub Actions. You can change the hyperpameters to tests/example_hyperparameters.txt.

Then, the output files should match those in the directory https://github.com/gitter-lab/SINGE/tree/master/tests/reference/latest. You can start by comparing the SINGE_Gene_Influence.txt and SINGE_Ranked_Edge_List.txt files you generate versus those stored in the repository. If those match, you can trust SINGE_Example.m is running correctly. If you want to test in more detail, I can give you instructions for running our Python code that will compare the entire adjacency matrices.

The most relevant test script, which you don't have to run but may be a useful reference, is https://github.com/gitter-lab/SINGE/blob/master/tests/standalone_test.sh

ElenaMerinoTejero commented 2 years ago

I ran SINGE_Example from docker with hyperparameeters from tests folder as @agitter indicated and output files indeed match those in /tests/reference/latest. Furthermore, no warning message appeared this time so I can trust SINGE running correctly. Thanks a lot for the help.

agitter commented 2 years ago

That's great! We can keep this issue open if you'd like to discuss strategies for running SINGE in parallel on a cluster as you scale up to your full dataset. That is a larger dataset than any we've tested on previously, so we're happy to help come up with strategies.

@atuldeshpande we should also separately follow up on whether glmnet_matlab_new.zip causes problems with the example dataset.

ElenaMerinoTejero commented 2 years ago

Hi @agitter, I reduced the dataset by selecting a particular cell type. The dataset now has 433 singe cells and a mean of 157 genes per cell. I am finding that when running this data set with tests/example_hyperparameters.txt the corresponding adjacency matrixes are outputted but the list of ranked edges and the gene influence files are not outputted. Furthermore, SINGE is killed when running run_SINGE_Aggregate.sh:

/usr/local/SINGE/run_SINGE_Aggregate.sh: line 30: 32 Killed "/usr/local/SINGE/SINGE_Aggregate" "GSE142016_RAW/SLE1/X_SCODE_data.mat" "GSE142016_RAW/SLE1/gene_list.mat" "Output"

Any clues as to why this may happen? Could it be a size problem?

BTW: It would also be helpful to discuss how to run SINGE in parallel on a cluster since I would like to run bigger data sets and with default hyperparameters.

ElenaMerinoTejero commented 2 years ago

About the reduced data set size: There are 9082 unique genes, thus the size of the resulting X data matrix is 433x9082.

agitter commented 2 years ago

Are you still running SINGE from the Docker container? If you were able to generate adjacency matrices successfully, you should now be able to run SINGE.sh in Aggregate mode to generate the edge list and gene influence files. I have not previously seen SINGE fail at this stage with the behavior you described, so we'll have to help you debug this problem.

One idea would be to copy a small number, perhaps 2-4, of the adjacency matrices to a new directory for testing. If those can be aggregated successfully, then it may indicate the dataset size is an issue. If that still fails, you could zip those adjacency matrices and the input .mat files so we could try reproducing the issue in Docker.

We have an example of how we ran SINGE on a cluster using HTCondor in this directory of our supplemental repository. The basic idea is that instead of creating a single hyperparameters file and running SINGE with all hyperparameter combinations in a single batch, each combination is split into a separate job. Those jobs can be parallelized over different nodes in the cluster. Then, after all jobs complete, the SINGE aggregate step can run. We can work through the details with you depending on your cluster setup and whether you will be using Docker or running MATLAB directly.

atuldeshpande commented 2 years ago

In addition, the MATLAB crashes are usually an issue only for the first part of SINGE, which require glmnet. Since you already have successfully navigated that part, you can try running SINGE aggregate through the Matlab functions. (I wonder if the Aggregate trying to load 9000x9000 matrices and perform additions on them may be causing memory issues?)

ElenaMerinoTejero commented 2 years ago

@agitter Yes, still running SINGE from docker. I followed your suggestion and copied 1 of the 4 Adjacency Matrixes to a test output folder to run Aggregate mode. The error persists without producing the list of ranked edges and the gene influence files. Attached are the zipped files so you can reproduce the issue. gene_list.mat.zip X_SCODE_data.mat.zip AdjMats.zip

With regards to running SINGE in a Cluster in parallel. I will be using docker and a Sonic HPC Cluster with the following characteristics. (https://www.ucd.ie/itservices/ourservices/researchit/researchcomputing/sonichpc/ In short, I will be able to use up to 48 cores, 50GB of file storage and 1.5TB of RAM. It would be nice to hear suggestions on how to adapt the example in SINGE-supplementary to run on Sonic HPC. Is it possible to run several datasets? Should the input data structure (.mat files) be modified? How should the hyperparameters be specified now? Is there any wrapper script example to see how to run it with docker?

ElenaMerinoTejero commented 2 years ago

@atuldeshpande I just tried to run aggregate mode on the 4 Adjacency matrixes through Matlab code and it did produce the Gene Influence and Ranked Edge List output files. Thanks for the suggestion.

gitter-lab / SINGE

Matlab crash while glmnetMex.mexmaci64 was running #69

ver -support

MATLAB Version: 9.8.0.1873465 (R2020a) Update 8 MATLAB License Number: 40707400 Operating System: Mac OS X Version: 10.15.7 Build: 19H1824 Java Version: Java 1.8.0_202-b08 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode