ventolab / CellphoneDB

CellPhoneDB can be used to search for a particular ligand/receptor, or interrogate your own HUMAN single-cell transcriptomics data.
https://www.cellphonedb.org/
MIT License
320 stars 51 forks source link

Running statistical_analysis_method gets stuck at Running Real Analysis #140

Closed anemartinezlarrinaga2898 closed 1 year ago

anemartinezlarrinaga2898 commented 1 year ago

Hello Im having the same issue as exposed here: https://github.com/ventolab/CellphoneDB/issues/102

However, I have changed the function that was giving error and still the analysis get stuck in the step: [ ][CORE][12/09/23-16:23:55][INFO] Running Real Analysis

Im not using a Windows, Im Mac based: Apple M1 Max

Also Im running the code in Spyder and not in reticulate or Jupyter notebooks as specify in the issue commented.

Thanks in advanced!

Best,

ANE

datasome commented 1 year ago

Hi Ane, Could you please do: pip install --force-reinstall "git+https://github.com/ventolab/CellphoneDB.git" and then set threads parameter to 1 and let me know if that worked? Thanks! Best, Robert.

anemartinezlarrinaga2898 commented 1 year ago

Hello Robert,

I tried reinstalling the library and still I get stack in the same point.

Reading user files... The following user files were loaded successfully: LigandReceptorAnalysis/Normalized_Log_Count.h5ad LigandReceptorAnalysis/metadata.tsv [ ][CORE][13/09/23-10:39:32][INFO] [Cluster Statistical Analysis] Threshold:0.1 Iterations:1000 Debug-seed:42 Threads:1 Precision:3 [ ][CORE][13/09/23-10:39:32][WARNING] Debug random seed enabled. Set to 42 [ ][CORE][13/09/23-10:40:17][INFO] Running Real Analysis zsh: killed python3 3.2_CellPhoneDB.py

datasome commented 1 year ago

Hi Ane, Hmm, the above is a different error - it looks as if your process is being killed rather than stuck - possibly due to lack of memory. Do you happen to know the maximum memory available to your processes in Spyder? Also, how many cells are you running your analysis for? Best, Robert.

anemartinezlarrinaga2898 commented 1 year ago

It might be, I explained my error to a colleague and she told me to look to that issue that it was what was happening to her. The error comes when running on the terminal, Im running in a MAC of 64 GB. I don't know how to look on the memory available in spyder. The amount of cells is: AnnData object with n_obs × n_vars = 219217 × 48440

datasome commented 1 year ago

Hi Ane, Based on the 219217 cells my rough estimate for RAM memory required for either the statistical analysis on 1 core or the DEG analysis is ~64GB - which is what you have on your mac (but not all of it will be available to your process). If you press command-space and type 'activity monitor', this will allow you to monitor the amount of (CPU or) RAM consumed while the analysis is running. If you see it increase dramatically as the analysis progresses and then the process gets killed, more than likely you don't have enough memory on your mac to run the analysis using CellphoneDB. If that's the case, I would suggest running it on a HPC cluster if you have one available at your institute. As an additional test if this is a memory-related problem, you could try to analyse an AnnData object that has fewer cells than the original one and see if the analysis completes ok. Best, Robert.

anemartinezlarrinaga2898 commented 1 year ago

It might be a memory issue as is almost in the limit while the analysis is running the used memory it increases from 28GB(that is what im using in basal levels) and it increases until 57,80. Will try to run it in a HPC and see if It is feasible. Thanks!

anemartinezlarrinaga2898 commented 1 year ago

I have tried in a HPC cluster and Im getting this error: Unable to allocate 716. GiB for an array with shape (2, 219217, 219217) and data type object.

This is the requested resources: `#SBATCH --job-name=cpdb # Job name

SBTACH --cpus-per-task=84

SBATCH --ntasks=6 # Run on a single CPU

SBATCH --mem=84gb # Job memory request

SBATCH --time=48:00:00 # Time limit hrs:min:sec

SBATCH --output=cpdb_%j.log # Standard output and error log

`

datasome commented 1 year ago

Hi Ane,

I'm not familiar with SLURM and somewhat confused by '716. GiB for an array with shape (2, 219217, 219217)', but for now according to https://hpc-wiki.info/hpc/SLURM, to specify memory you need to do:

SBATCH --mem=84G

instead of

SBATCH --mem=84gb

This is unlikely to be the problem with the tool - to eliminate this possibility please try analysing https://github.com/ventolab/CellphoneDB/tree/master/example_data first on your mac.

Best, Robert.

anemartinezlarrinaga2898 commented 1 year ago

Thanks you, will be trying with the example first. But I have asked the IT Unit and see if they can help me. Will reach again to the issue if the examples does not work either.

anemartinezlarrinaga2898 commented 1 year ago

I was able to run the example so definitely I have a memory issue, I was wondering if it would make sense to take 100-200 cells for each cell type in my object randomly and the run the analysis?

ktroule commented 1 year ago

Hi.

The method 2 (Statistical Analysis) incorporates an argument that allows you to subsample your dataset employing a geometric sketching approach. You can also downsample your count object to 1/2, 1/3, 1/4 or any other ratio before input it to CellPhoneDB.

Regards

datasome commented 1 year ago

Ane,

To add to Kevin's comment, the subsampling will make your analysis faster but does in fact require even more RAM memory. For the number of cells you're analysing, the memory requirement is likely to be ~110GB.

Best, Robert.

anemartinezlarrinaga2898 commented 1 year ago

Thanks to both!

Will try for strategies to downsample my dataset before analyzing and doing the downsampling and see if I can manage to run it!

Thanks again!

anemartinezlarrinaga2898 commented 1 year ago

I have updates: I have tried the subsampling option of the second method and I was unable to run it. Also I have try to do the subsampling myself getting only the 10% of each of the cell types and Im getting this error when it get to the point of Running the real analysis. : /var/spool/slurmd/job3023928/slurm_script: line 50: 86001 Killed python3 3.2_CellPhoneDB.py

datasome commented 1 year ago

Hi Ane,

It looks as if you keep running out of memory - out of interest, does the job get killed right away, or does it run for a while and then get killed? And how are you now specifying the memory for your job on Slurm - are you using e.g. just G instead of GB, as I suggested above? I ask because according to https://slurm.schedmd.com/cons_res.html the default value for --mem is 1MB - hence if you're not setting it correctly 1MB is what you jobs gets, and if so, it gets killed right away. Could you please confirm? Thanks, Robert.

anemartinezlarrinaga2898 commented 1 year ago

Hello,

It runs for approximately between 20-23 minutes and then it get killed

This is how Im specifying the resources need:

SBATCH --mem=720gb

Best, Ane

datasome commented 1 year ago

Hi Ana,

Hmm, could you please try specify #SBATCH --mem=720G instead and try again?

Best,

Robert.

anemartinezlarrinaga2898 commented 1 year ago

Hellos,

I have checked with my IT unit and with 720GB de memory is 127% so is over, Im trying asking for 850GB

datasome commented 1 year ago

Hi Ane, I'm still confused by this exceedingly large memory requirement. In all the tests I've done so far, for 219217 cells, the memory requirement shouldn't far exceed 64GB.. Looking at the '(2, 219217, 219217)' shape mentioned above, I'm now wondering if your anndata.obs dataframe is perhaps exceedingly large/3-dimensional? Typically I would expect that dataframe to be 2-dimensional, with the shape (219217, N), where N is the number of characteristics that is much smaller than 219217. Could you please confirm if/why your anndata.obs might be so big/3-dimensional? More than likely anndata.read_h5ad method crashes due to lack of memory when reading the h5ad file before it even gets to the analysis part. Best, Robert.

anemartinezlarrinaga2898 commented 1 year ago

Hello Robert,

You are speaking about the metadata? Or de count_matrix?

My object is Seurat based so Im converting from Seurat to AnnData. Once the object is generated this is the code that I used for generating the input files:

- Count Matrix

adata = sc.read_h5ad('AnnData_DataTotalSubset.h5ad')
adata.layers["log_transformed"] = np.log1p(adata.X)
count_normalized=adata.layers["log_transformed"]
adata.write("LigandReceptorAnalysis/Normalized_Log_Count.h5ad")

Count matrix format: <21907x48440 sparse matrix of type '<class 'numpy.float32'> with 28258808 stored elements in Compressed Sparse Row format>

- Metadata

metadata=adata.obs
cells_barcode=metadata.index # To get the rownames in pandas
metadata_cellbarcode = metadata.assign(barcode_sample=cells_barcode)
#metadata_ligandreceptor=metadata.loc[:, "barcode_sample":"LigandReceptor"] # To subset a range of columns
metadata_ligandreceptor = metadata_cellbarcode[['barcode_sample', 'LigandReceptor']]
metadata_ligandreceptor.to_csv('LigandReceptorAnalysis/metadata.tsv', sep='\t')

Metadata format: [21907 rows x 2 columns]

However, I have managed to run it with only 2100 cells, my idea is to see how many memory it consumes and then estimate how many I need for 21907 cells. However it is going slowly it has been running for almost 2 hours and it has only estimate the 2%

datasome commented 1 year ago

Hi Ane,

No, I was talking about adata.obs not adata.X, in the context of the error you mentioned above: 'error: Unable to allocate 716. GiB for an array with shape (2, 219217, 219217) and data type object.' when you were trying to use the original h5ad file containing 219217 cells and for which your process seemed to be using upwards of 700GB (so 10 times what I would have predicted based on the number of cells). I'm just trying to understand what in your anndata object is 3-dimensional and perhaps bloats the memory so much.

Would it be possible for you to share with me the h5ad file (the one with 21907 cells), the metadata.tsv file and the CellphoneDB analysis call you used? I would be interested to test it locally. If so, could you send the details of how I could access them to contact@cellphonedb.org? For simplicity, we could continue the conversation via that email address and just report here once we've found the issue. Thanks!

Best, Robert.

anemartinezlarrinaga2898 commented 1 year ago

Hello Robert,

I will generate a zip file and send them via email!

Thanks,

Best,

ANE

datasome commented 1 year ago

This issue was due to the user using incorrectly formatted metadata file - for the correct format see: https://cellphonedb.readthedocs.io/en/latest/RESULTS-DOCUMENTATION.html#meta-file