Running statistical_analysis_method gets stuck at 0%

AClab-sgarcia commented 1 year ago

Good morning,

I am trying to run statistical_analysis_method on a large dataset, more than 35k cells, and when running the Statistica Analysis it gets stuck. [ ][CORE][04/04/23-12:21:27][INFO] Running Statistical Analysis 0%| | 0/1000 [00:00<?, ?it/s]

I have tried it with the example in the notebooks, and same thing happen. Could please somebody give a hint on what is going on?

On the other hand, when dealing with sucha a big amount of cells, what's the best "subsampling_num_cells" to use?

Thanks!

datasome commented 1 year ago

Hi,

Thank you for using CellphoneDB.

The example dataset in https://github.com/ventolab/CellphoneDB/blob/master/notebooks/data_tutorial.zip contains ~3k cells, hence it is unlikely that the lack of subsampling is the cause of your statistical analysis run getting stuck on 0%. Please could you confirm that you've installed the package in a clean virtual environment? https://stackoverflow.com/questions/67506630/python-tqdm-progress-bar-stuck-at-0 may also help. Please let us know how you got on.

As an aside, https://github.com/ventolab/CellphoneDB/blob/master/notebooks/T01_Method2_with_subsampling.ipynb provides some details on subsampling via geometric sketching. Note that if you don't provide subsampling_num_cells, 1/3 of the cells will be used. This may be a good starting point.

Best regards,

Robert.

AClab-sgarcia commented 1 year ago

Thanks for the fast and kind reply @datasome

I created a new virtual environment tu run CellPhoneDB

pip freeze
anndata==0.8.0
asttokens==2.2.1
backcall==0.2.0
biopython==1.81
CellphoneDB==4.0.0
certifi==2022.12.7
charset-normalizer==3.1.0
colorama==0.4.6
comm==0.1.3
contourpy==1.0.7
cycler==0.11.0
debugpy==1.6.6
decorator==5.1.1
executing==1.2.0
fbpca==1.0
fonttools==4.39.2
geosketch==1.2
h5py==3.8.0
idna==3.4
importlib-metadata==6.1.0
importlib-resources==5.12.0
ipykernel==6.22.0
ipython==8.11.0
ipywidgets==8.0.6
jedi==0.18.2
joblib==1.2.0
jupyter_client==8.1.0
jupyter_core==5.3.0
jupyterlab-widgets==3.0.7
kiwisolver==1.4.4
ktplotspy==0.1.8
matplotlib==3.7.1
matplotlib-inline==0.1.6
mizani==0.8.1
natsort==8.3.1
nest-asyncio==1.5.6
numpy==1.24.2
numpy-groupies==0.9.20
packaging==23.0
palettable==3.3.0
pandas==1.5.3
parso==0.8.3
patsy==0.5.3
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==3.2.0
plotnine==0.10.1
prompt-toolkit==3.0.38
psutil==5.9.4
pure-eval==0.2.2
Pygments==2.14.0
pyparsing==3.0.9
python-circos==0.3.0
python-dateutil==2.8.2
pytz==2023.2
pywin32==306
pyzmq==25.0.2
requests==2.28.2
scikit-learn==0.24.0
scipy==1.10.1
seaborn==0.12.2
six==1.16.0
stack-data==0.6.2
statsmodels==0.13.5
threadpoolctl==3.1.0
tornado==6.2
tqdm==4.65.0
traitlets==5.9.0
tzdata==2023.2
urllib3==1.26.15
wcwidth==0.2.6
widgetsnbextension==4.0.7
zipp==3.15.0
(CellPhoneDB)

I tried installing ipywidgets as suggested in: https://stackoverflow.com/questions/67506630/python-tqdm-progress-bar-stuck-at-0, but still not working

Saioa

datasome commented 1 year ago

Hi Saioa,

Hmm, I see pandas==1.5.3 above (released on 18 Jan 2023, according to https://pandas.pydata.org/docs/whatsnew/v1.5.3.html, i.e. well before we released CellphoneDB v4.0.0 - on 10 March 2023)?

What I've just now tested is the following: conda create -n cpdb python=3.9 source activate cpdb pip install cellphonedb pip install jupyter Then I run jupyter notebook, tested statistical method and it worked as expected. Would you mind trying the same (i.e. from scratch) and letting us know how you got on? Thanks,

Robert.

AClab-sgarcia commented 1 year ago

Thanks for the kind reply Robert,

I am using your package through reticuate in RStudio. For this, I have created a virtual environment from scratch and installed the package. For this reason, I cannot use Jupiter, maybe this is what prevents me from working correctly? I have used this approach with other packages and I have never had any problems, I don't know if there is any way to solve it.

To be more specific, I used this same approach with the previous version of CellPhoneDB and had no problem using it.

Thanks!

datasome commented 1 year ago

Hi Saioa,

It's hard for me to comment as I'm not familiar with reticulate. It would make sense to ascertain if there was some conflict between reticulate and the python module called tqdm that we used to implement the progress bar during the statistical analysis run. To test this, could I please ask that you run: pip install --force-reinstall "git+https://github.com/ventolab/CellphoneDB.git@reticulate" and in cpdb_statistical_analysis_method.call() provide one additional argument: progress_bar = False and then let us know how you got on?

Best regards, Robert.

AClab-sgarcia commented 1 year ago

Hi Robert,

Thank you very much for your efforts to help me.

I have updated the package as you have told me:

$ pip freeze 
anndata==0.8.0 
asttokens==2.2.1 
backcall==0.2.0 
biopython==1.81
cellphonedb @ git+https://github.com/ventolab/CellphoneDB.git@3b306f7e4b369889763a597c85bc1ad7c3b4ecb6
certifi==2022.12.7
charset-normalizer==3.1.0
colorama==0.4.6
comm==0.1.3
contourpy==1.0.7
cycler==0.11.0
debugpy==1.6.6
decorator==5.1.1
executing==1.2.0
fbpca==1.0
fonttools==4.39.3
geosketch==1.2
h5py==3.8.0
idna==3.4
importlib-metadata==6.1.0
importlib-resources==5.12.0
ipykernel==6.22.0
ipython==8.11.0
ipywidgets==8.0.6
jedi==0.18.2
joblib==1.2.0
jupyter_client==8.1.0
jupyter_core==5.3.0
jupyterlab-widgets==3.0.7
kiwisolver==1.4.4
ktplotspy==0.1.9
matplotlib==3.7.1
matplotlib-inline==0.1.6
mizani==0.8.1
natsort==8.3.1
nest-asyncio==1.5.6
numpy==1.24.2
numpy-groupies==0.9.20
packaging==23.0
palettable==3.3.1
pandas==2.0.0
parso==0.8.3
patsy==0.5.3
pickleshare==0.7.5
Pillow==9.5.0
platformdirs==3.2.0
plotnine==0.10.1
prompt-toolkit==3.0.38
psutil==5.9.4
pure-eval==0.2.2
Pygments==2.14.0
pyparsing==3.0.9
python-circos==0.3.0
python-dateutil==2.8.2
pytz==2023.3
pywin32==306
pyzmq==25.0.2
requests==2.28.2
scikit-learn==0.24.0
scipy==1.10.1
seaborn==0.12.2
six==1.16.0
stack-data==0.6.2
statsmodels==0.13.5
threadpoolctl==3.1.0
tornado==6.2
tqdm==4.65.0
traitlets==5.9.0
tzdata==2023.3
urllib3==1.26.15
wcwidth==0.2.6
widgetsnbextension==4.0.7
zipp==3.15.0
(CellPhoneDB)

I have used the notebook example but it still does not work. The progress bar does not appear but the analysis does not finish:

deconvoluted, means, pvalues, significant_means = cpdb_statistical_analysis_method.call(
    cpdb_file_path = cpdb_file_path,                 
    meta_file_path = meta_file_path,                 
    counts_file_path = counts_file_path,             
    counts_data = 'hgnc_symbol',                    
    microenvs_file_path = microenvs_file_path,       
    iterations = 1000,                               
    threshold = 0.1,                                 
    threads = 4,                                     
    debug_seed = 42,                                 
    result_precision = 3,                            
    pvalue = 0.05,                                   
    subsampling = True,                              
    subsampling_log = False,                         
    subsampling_num_pc = 100,                      
    subsampling_num_cells = 3312,                    
    separator = '|',                               
    debug = False,                                   
    output_path = out_path,                          
    output_suffix = None,                           
    progress_bar = False                             
)

imagen

Thank you very much for your help! Saioa

datasome commented 1 year ago

Hi Saioa, Could you please try and run the package within Jupyter notebook but outside of reticulate, so that we can establish if the issue is somehow to do with your machine or your reticulate? Best wishes, Robert.

AClab-sgarcia commented 1 year ago

Hi Robert,

I have tried to run it in Jupyter notebook and both the original version and the version you have made for reticulate works, so I assume that it is not a problem of my machine, but of the connection with reticulate.

On the other hand, when the "Building results" step begins, with both options (original and reticulate versions). I get the following error:

pip freeze

aiofiles==22.1.0Note: you may need to restart the kernel to use updated packages.

aiosqlite==0.18.0
anndata==0.9.0
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
attrs==22.2.0
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.12.2
biopython==1.81
bleach==6.0.0
CellphoneDB==4.0.0
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.1.0
colorama==0.4.6
comm==0.1.3
contourpy==1.0.7
cycler==0.11.0
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
executing==1.2.0
fastjsonschema==2.16.3
fbpca==1.0
fonttools==4.39.3
fqdn==1.5.1
geosketch==1.2
h5py==3.8.0
idna==3.4
importlib-metadata==6.3.0
importlib-resources==5.12.0
ipykernel==6.22.0
ipython==8.12.0
ipython-genutils==0.2.0
isoduration==20.11.0
jedi==0.18.2
Jinja2==3.1.2
joblib==1.2.0
json5==0.9.11
jsonpointer==2.3
jsonschema==4.17.3
jupyter-events==0.6.3
jupyter-ydoc==0.2.3
jupyter_client==8.1.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_fileid==0.9.0
jupyter_server_terminals==0.4.4
jupyter_server_ydoc==0.8.0
jupyterlab==3.6.3
jupyterlab-pygments==0.2.2
jupyterlab_server==2.22.0
kiwisolver==1.4.4
ktplotspy==0.1.9
MarkupSafe==2.1.2
matplotlib==3.7.1
matplotlib-inline==0.1.6
mistune==2.0.5
mizani==0.8.1
natsort==8.3.1
nbclassic==0.5.5
nbclient==0.7.3
nbconvert==7.3.1
nbformat==5.8.0
nest-asyncio==1.5.6
notebook==6.5.4
notebook_shim==0.2.2
numpy==1.24.2
numpy-groupies==0.9.20
packaging==23.0
palettable==3.3.1
pandas==1.5.0
pandocfilters==1.5.0
parso==0.8.3
patsy==0.5.3
pickleshare==0.7.5
Pillow==9.5.0
platformdirs==3.2.0
plotnine==0.10.1
prometheus-client==0.16.0
prompt-toolkit==3.0.38
psutil==5.9.4
pure-eval==0.2.2
pycparser==2.21
Pygments==2.15.0
pyparsing==3.0.9
pyrsistent==0.19.3
python-circos==0.3.0
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3
pywin32==306
pywinpty==2.0.10
PyYAML==6.0
pyzmq==25.0.2
requests==2.28.2
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
scikit-learn==0.24.0
scipy==1.10.1
seaborn==0.12.2
Send2Trash==1.8.0
six==1.16.0
sniffio==1.3.0
soupsieve==2.4
stack-data==0.6.2
statsmodels==0.13.5
terminado==0.17.1
threadpoolctl==3.1.0
tinycss2==1.2.1
tomli==2.0.1
tornado==6.2
tqdm==4.65.0
traitlets==5.9.0
typing_extensions==4.5.0
tzdata==2023.3
uri-template==1.2.0
urllib3==1.26.15
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.1
y-py==0.5.9
ypy-websocket==0.8.2
zipp==3.15.0

import os
import importlib
import warnings
warnings.filterwarnings("ignore")
import glob

import anndata as ad
import pandas as pd
import pickle as pkl
import IPython
import cellphonedb
from cellphonedb.utils import db_utils

# For section 3 & 4
from cellphonedb.src.core.methods import cpdb_statistical_analysis_method
from cellphonedb.utils import search_utils
import ktplotspy as kpy

# Get the current working directory
cwd = os.getcwd()

# Print the current working directory
print("Current working directory: {0}".format(cwd))

Current working directory: C:\Users\sgarcia\Documents

os.chdir("".join([cwd, "/data_tutorial/"]))

os.getcwd()

'C:\\Users\\sgarcia\\Documents\\data_tutorial'

# Inspect input files
cpdb_file_path = 'db/cellphonedb.zip'
meta_file_path = 'data/metadata.tsv'
counts_file_path = 'data/normalised_log_counts.h5ad'
microenvs_file_path = 'data/microenvironment.tsv'
out_path = 'method2_with_subsampling'

metadata = pd.read_csv(meta_file_path, sep = '\t')
metadata.head(3)

	barcode_sample	cell_type
0	AGCGATTAGTCTAACC-1_Pla_HDBR10917733	B_cells
1	ATCCGTGAGGCTAGAA-1_Pla_Camb10714918	B_cells
2	AGTAACCCATTAAAGG-1_Pla_HDBR10917733	B_cells

import anndata

adata = anndata.read_h5ad(counts_file_path)
adata.shape

list(adata.obs.index).sort() == list(metadata['barcode_sample']).sort()

microenv = pd.read_csv(microenvs_file_path, sep = '\t')
microenv.head(3)

microenv.groupby('microenvironment', group_keys = False)['cell_type'].apply(lambda x : list(x.value_counts().index))

microenvironment
Env1    [PV MMP11, PV MYH11, PV STEAP4, EVT_1, EVT_2, ...
Name: cell_type, dtype: object

from cellphonedb.src.core.methods import cpdb_statistical_analysis_method

deconvoluted, means, pvalues, significant_means = cpdb_statistical_analysis_method.call(
    cpdb_file_path = cpdb_file_path,                 # mandatory: CellPhoneDB database zip file.
    meta_file_path = meta_file_path,                 # mandatory: tsv file defining barcodes to cell label.
    counts_file_path = counts_file_path,             # mandatory: normalized count matrix.
    counts_data = 'hgnc_symbol',                     # defines the gene annotation in counts matrix.
    microenvs_file_path = microenvs_file_path,       # optional (default: None): defines cells per microenvironment.
    iterations = 1000,                               # denotes the number of shufflings performed in the analysis.
    threshold = 0.1,                                 # defines the min % of cells expressing a gene for this to be employed in the analysis.
    threads = 4,                                     # number of threads to use in the analysis.
    debug_seed = 42,                                 # debug randome seed. To disable >=0.
    result_precision = 3,                            # Sets the rounding for the mean values in significan_means.
    pvalue = 0.05,                                   # P-value threshold to employ for significance.
    subsampling = True,                              # To enable subsampling the data (geometri sketching).
    subsampling_log = False,                         # (mandatory) enable subsampling log1p for non log-transformed data inputs.
    subsampling_num_pc = 100,                        # Number of componets to subsample via geometric skectching (dafault: 100).
    subsampling_num_cells = 3312,                    # Number of cells to subsample (integer) (default: 1/3 of the dataset).
    separator = '|',                                 # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
    debug = False,                                   # Saves all intermediate tables employed during the analysis in pkl format.
    output_path = out_path,                          # Path to save results.
    output_suffix = None,                            # Replaces the timestamp in the output files by a user defined string in the  (default: None).
    )

Reading user files...
The following user files were loaded successfully:
data/normalised_log_counts.h5ad
data/metadata.tsv
data/microenvironment.tsv
[ ][CORE][12/04/23-12:34:26][INFO] Subsampling 3312 to 3312
[ ][CORE][12/04/23-12:34:28][INFO] Done subsampling 3312 to 3312
[ ][CORE][12/04/23-12:34:29][INFO] [Cluster Statistical Analysis] Threshold:0.1 Iterations:1000 Debug-seed:42 Threads:4 Precision:3
[ ][CORE][12/04/23-12:34:29][WARNING] Debug random seed enabled. Set to 42
[ ][CORE][12/04/23-12:34:29][INFO] Running Real Analysis
[ ][CORE][12/04/23-12:34:29][INFO] Limiting cluster combinations using microenvironments
[ ][CORE][12/04/23-12:34:29][INFO] Running Statistical Analysis

100%|███████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:55<00:00, 17.90it/s]

[ ][CORE][12/04/23-12:35:25][INFO] Building Pvalues result

[ ][CORE][12/04/23-12:35:25][INFO] Building results

---------------------------------------------------------------------------

OSError                                   Traceback (most recent call last)

Cell In[7], line 3
      1 from cellphonedb.src.core.methods import cpdb_statistical_analysis_method
----> 3 deconvoluted, means, pvalues, significant_means = cpdb_statistical_analysis_method.call(
      4     cpdb_file_path = cpdb_file_path,                 # mandatory: CellPhoneDB database zip file.
      5     meta_file_path = meta_file_path,                 # mandatory: tsv file defining barcodes to cell label.
      6     counts_file_path = counts_file_path,             # mandatory: normalized count matrix.
      7     counts_data = 'hgnc_symbol',                     # defines the gene annotation in counts matrix.
      8     microenvs_file_path = microenvs_file_path,       # optional (default: None): defines cells per microenvironment.
      9     iterations = 1000,                               # denotes the number of shufflings performed in the analysis.
     10     threshold = 0.1,                                 # defines the min % of cells expressing a gene for this to be employed in the analysis.
     11     threads = 4,                                     # number of threads to use in the analysis.
     12     debug_seed = 42,                                 # debug randome seed. To disable >=0.
     13     result_precision = 3,                            # Sets the rounding for the mean values in significan_means.
     14     pvalue = 0.05,                                   # P-value threshold to employ for significance.
     15     subsampling = True,                              # To enable subsampling the data (geometri sketching).
     16     subsampling_log = False,                         # (mandatory) enable subsampling log1p for non log-transformed data inputs.
     17     subsampling_num_pc = 100,                        # Number of componets to subsample via geometric skectching (dafault: 100).
     18     subsampling_num_cells = 3312,                    # Number of cells to subsample (integer) (default: 1/3 of the dataset).
     19     separator = '|',                                 # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
     20     debug = False,                                   # Saves all intermediate tables employed during the analysis in pkl format.
     21     output_path = out_path,                          # Path to save results.
     22     output_suffix = None,                            # Replaces the timestamp in the output files by a user defined string in the  (default: None).
     23     )

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\cellphonedb\src\core\methods\cpdb_statistical_analysis_method.py:132, in call(cpdb_file_path, meta_file_path, counts_file_path, counts_data, output_path, microenvs_file_path, iterations, threshold, threads, debug_seed, result_precision, pvalue, subsampling, subsampling_log, subsampling_num_pc, subsampling_num_cells, separator, debug, output_suffix)
    129 significant_means['rank'] = significant_means['rank'].apply(lambda rank: rank if rank != 0 else (1 + max_rank))
    130 significant_means.sort_values('rank', inplace=True)
--> 132 file_utils.save_dfs_as_tsv(output_path, output_suffix, "statistical_analysis", \
    133                         {"deconvoluted" : deconvoluted, \
    134                         "means" : means, \
    135                         "pvalues" : pvalues, \
    136                         "significant_means" : significant_means} )
    138 return deconvoluted, means, pvalues, significant_means

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\cellphonedb\utils\file_utils.py:212, in save_dfs_as_tsv(out, suffix, analysis_name, name2df)
    210 for name, df in name2df.items():
    211     file_path = os.path.join(out, "{}_{}_{}.{}".format(analysis_name, name, suffix, "txt"))
--> 212     df.to_csv(file_path, sep = '\t', index=False)
    213     print("Saved {} to {}".format(name, file_path))

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\core\generic.py:3721, in NDFrame.to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, decimal, errors, storage_options)
   3710 df = self if isinstance(self, ABCDataFrame) else self.to_frame()
   3712 formatter = DataFrameFormatter(
   3713     frame=df,
   3714     header=header,
   (...)
   3718     decimal=decimal,
   3719 )
-> 3721 return DataFrameRenderer(formatter).to_csv(
   3722     path_or_buf,
   3723     lineterminator=lineterminator,
   3724     sep=sep,
   3725     encoding=encoding,
   3726     errors=errors,
   3727     compression=compression,
   3728     quoting=quoting,
   3729     columns=columns,
   3730     index_label=index_label,
   3731     mode=mode,
   3732     chunksize=chunksize,
   3733     quotechar=quotechar,
   3734     date_format=date_format,
   3735     doublequote=doublequote,
   3736     escapechar=escapechar,
   3737     storage_options=storage_options,
   3738 )

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\util\_decorators.py:211, in deprecate_kwarg.<locals>._deprecate_kwarg.<locals>.wrapper(*args, **kwargs)
    209     else:
    210         kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\formats\format.py:1189, in DataFrameRenderer.to_csv(self, path_or_buf, encoding, sep, columns, index_label, mode, compression, quoting, quotechar, lineterminator, chunksize, date_format, doublequote, escapechar, errors, storage_options)
   1168     created_buffer = False
   1170 csv_formatter = CSVFormatter(
   1171     path_or_buf=path_or_buf,
   1172     lineterminator=lineterminator,
   (...)
   1187     formatter=self.fmt,
   1188 )
-> 1189 csv_formatter.save()
   1191 if created_buffer:
   1192     assert isinstance(path_or_buf, StringIO)

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\formats\csvs.py:241, in CSVFormatter.save(self)
    237 """
    238 Create the writer & save.
    239 """
    240 # apply compression and byte/text conversion
--> 241 with get_handle(
    242     self.filepath_or_buffer,
    243     self.mode,
    244     encoding=self.encoding,
    245     errors=self.errors,
    246     compression=self.compression,
    247     storage_options=self.storage_options,
    248 ) as handles:
    249 
    250     # Note: self.encoding is irrelevant here
    251     self.writer = csvlib.writer(
    252         handles.handle,
    253         lineterminator=self.lineterminator,
   (...)
    258         quotechar=self.quotechar,
    259     )
    261     self._save()

File ~\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\common.py:857, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    852 elif isinstance(handle, str):
    853     # Check whether the filename is to be opened in binary mode.
    854     # Binary mode does not support 'encoding' and 'newline'.
    855     if ioargs.encoding and "b" not in ioargs.mode:
    856         # Encoding
--> 857         handle = open(
    858             handle,
    859             ioargs.mode,
    860             encoding=ioargs.encoding,
    861             errors=errors,
    862             newline="",
    863         )
    864     else:
    865         # Binary mode
    866         handle = open(handle, ioargs.mode)

OSError: [Errno 22] Invalid argument: 'method2_with_subsampling\\statistical_analysis_deconvoluted_04_12_2023_12:35:26.txt'

Thanks

datasome commented 1 year ago

Hi Saioa,

Thanks for all the info. To summarize, the following is currently the case:

The original package version never gets past 0% and the reticulate version never finishes when you run your analysis on the notebook run via reticulate
Both the original and reticulate versions appear to run when you run jupyter notebook outside of reticulate, but the 'Building results' phase fails with 'OSError: [Errno 22] Invalid argument: 'method2_with_subsampling\statistical_analysis_deconvoluted_04_12_2023_12:35:26.txt' error.

Could I please ask you to perform the following tests:

As you appear to use Windows, In jupyter notebook outside of reticulate: a. using the original version, try to set out_path to either 'C:/Users/sgarcia/Documents/data_tutorial/method2_with_subsampling' or './method2_with_subsampling' and see if that makes a difference (c.f. https://stackoverflow.com/questions/57673922/writing-a-pandas-dataframe-to-csv) b. get the latest reticulate version and then leave out_path as 'method2_with_subsampling' and see if this works
In the notebook run via reticulate, using the original version, try to run the statistical analysis using files in https://github.com/ventolab/CellphoneDB/tree/master/example_data . I just wanted to eliminate the possibility that say available memory is an issue when you use reticulate to run the notebook.

Best wishes,

Robert.

AClab-sgarcia commented 1 year ago

Hi Robert,

Regarding what you asked: 1.a. I used jupyter notebook with:

out_path = './method2_with_subsampling'

and

out_path = 'C:/Users/sgarcia/Documents/data_tutorial/method2_with_subsampling'

and none worked, I keep having same issue:

OSError: [Errno 22] Invalid argument:

1.b. Using it inside reticulate with those changes do not work as the original issue it's not fixed, the statistical analysis it's not completed.

I tried both the original and the @reticulate versions of your library with the steps in yout notebook and with your data, and keep having the same errors explained above.

Therefore, and as a summary:

In RStudio via Reticulate:
- CellPhoneDB(original) when using cpdb_statistical_analysis_method in RStudio via reticualte, it does not work because the bar is stuck at 0%.
- CellPhoneDB@reticulate, the option that you created where you could remove the bar display does not work either when running it in RStudio.

I have tested both with my data and also with your data in the examples.

Using Jupyter Notebooks: both CellPhoneDB(original) and CellPhoneDB@reticulate work but, there is error: 'OSError: [Errno 22] Invalid argument: 'method2_with_subsampling_statistical_analysis_deconvoluted_04_12_2023_12:35:26.txt' error. I have also tested this option with my own data as well as with yours from the example.

I have also tested this option with my own data as well as with yours from the example and also by changing the output paths as you suggested in the previous message.

Thank you very much Saioa

datasome commented 1 year ago

Hi Saioa, Thanks for the update - 1b needed to be run in jupyter notebook outside of reticulate but using the latest code from reticulate branch of CellphoneDB package. Apologies - I appreciate this is getting confusing.. Could you please try that? Thanks! Robert.

AClab-sgarcia commented 1 year ago

Hello!

I'm sorry for the confusion. I have tried to use it in Jupyter notebook with CellPhoneDB@reticulate in the following ways:

out_path = 'method2_with_subsampling'
out_path = './method2_with_subsampling'
out_path = 'C:/Users/sgarcia/Documents/data_tutorial/method2_with_subsampling' and I keep getting the same error as with CellPhoneDB (original): 'OSError: [Errno 22] Invalid argument: 'method2_with_subsampling\statistical_analysis_deconvoluted_04_12_2023_12:35:26.txt'

Thanks! Saioa

datasome commented 1 year ago

Hi Saioa, Thanks for the feedback. To progress on running Jupyter notebook with CellPhoneDB@reticulate, I've commented out temporarily the code to save resulting DataFrames to files. Hence when you run the analysis (having pulled the latest CellPhoneDB@reticulate), hopefully the analysis succeeds and you will have the DataFrames available to you in the notebook. Would you mind experimenting with saving the DataFrames using save_dfs_as_tsv function in https://github.com/ventolab/CellphoneDB/blob/reticulate/cellphonedb/utils/file_utils.py as a starting point? Perhaps there's some way of using os.path.join and/or os.path.abspath that works on Windows? It's difficult for me to test this locally as I don't have access to a Windows machine. Good luck and thanks! Robert.

AClab-sgarcia commented 1 year ago

Hi Robert,

In the end I was able to fix it. The problem was not in the paths, I was able to keep: out_path = 'method2_with_subsampling' The error comes from the way of saving the timestamp in the function get_timestampsuffix(): I have changed ("%m%d%Y%Y%H:%M:%S"), to ("%m%d%Y%H%M%S"), and now the files are saved without problem. https://stackoverflow.com/a/75650000 https://www.pythonpool.com/oserror-errno22-invalid-argument-solved/

However, it still does not work in RStudio with the reticulate library.

Thanks Saioa

datasome commented 1 year ago

Hi Saioa,

That's great - well done for finding the cause! I've just made that fix in our master branch. Would you mind doing pip install --force-reinstall "git+https://github.com/ventolab/CellphoneDB.git" and testing that it works also?

On reticulate, I've just installed R studio, and did the following:

install.packages("reticulate") library(reticulate) use_condaenv(condaenv = 'cpdb_reticulate', required = TRUE) repl_python() where cpdb_reticulate is my clean venv with https://github.com/ventolab/CellphoneDB.git@reticulated installed in it. Then I was able to run both basic and statistical analyses and they run fine (no jupyter notebook involved). Could you please confirm how you run CellphoneDB from reticulate so that I can try and replicate it locally?

Best wishes,

Robert.

AClab-sgarcia commented 1 year ago

Hello Robert,

I have tried to redo the analysis using the correction in the master branch and now it works perfectly fine.

Regarding running it in R, it still doesn't work for me, let me tell you what I have done:

Make a new virtual env. to install https://github.com/ventolab/CellphoneDB.git@reticulate
In RStudio -> Tools -> Global Options -> Python, change the virtual environment to use the one created in step 1
repl_python()

In the cpdb_statistical_analysis_method() function I used progress_bar = False but still does not work, gets stuck in the same point,

Reading user files...
The following user files were loaded successfully:
data/normalised_log_counts.h5ad
data/metadata.tsv
data/microenvironment.tsv
[ ][CORE][14/04/23-09:21:42][INFO] Subsampling 3312 to 3312
[ ][CORE][14/04/23-09:21:44][INFO] Done subsampling 3312 to 3312
[ ][CORE][14/04/23-09:21:45][INFO] [Cluster Statistical Analysis] Threshold:0.1 Iterations:1000 Debug-seed:42 Threads:4 Precision:3
[ ][CORE][14/04/23-09:21:45][WARNING] Debug random seed enabled. Set to 42
[ ][CORE][14/04/23-09:21:45][INFO] Running Real Analysis
[ ][CORE][14/04/23-09:21:45][INFO] Limiting cluster combinations using microenvironments
[ ][CORE][14/04/23-09:21:45][INFO] Running Statistical Analysis

Thanks

datasome commented 1 year ago

Hi Saioa, Thanks for the update - I've used RStudio -> Tools -> Global Options -> Python rather than use_condaenv but it's still working for me. I'm now suspecting running python via reticulate on Windows has a problem with the module we use for parallelising our statistical analysis (see: https://github.com/rstudio/reticulate/issues/1353). I've just made a change to https://github.com/ventolab/CellphoneDB.git@reticulate to not use that module when threads is set to 1. Would you mind pulling the latest from the reticulate branch, setting threads=1 and seeing if it works? I've added a basic progress display to ease the pain of waiting for the result in single-threaded mode.

If the above works, it may be that the trade-off for running via reticulate will be having to wait a little longer for the results..

Let me know how you get on.

Best wishes,

Robert.

AClab-sgarcia commented 1 year ago

Hi Robert!

It worked!!

Thank you very much for your help, I really appreciate it.

Saioa

ventolab / CellphoneDB

Running statistical_analysis_method gets stuck at 0% #102