InvalidIndexError: Reindexing only valid with uniquely valued Index objects #97

Closed jinlanzhan closed 7 months ago

jinlanzhan commented 1 year ago
cpdb_file_path = '/group_homes/public_cluster/home/u37189/Project/20230102_smell_loss/cellphoneDB/out_total_repository/db/v4.1.0/'
meta_file_path = '/group_homes/public_cluster/home/u37189/Project/20230102_smell_loss/cellphoneDB/124068cell+33genes_metadata.csv'
#counts_file_path = '/group_homes/public_cluster/home/u37189/Project/20230102_smell_loss/cellphoneDB/COVID_dataset_scvi_integration_129668_35259_addvirus_h5ad.h5ad'
counts_file_path = '/group_homes/public_cluster/home/u37189/Project/20230102_smell_loss/cellphoneDB/124068cell+33genes_log1p_normmatrix.h5ad'

from cellphonedb.src.core.methods import cpdb_statistical_analysis_method

deconvoluted, means, pvalues, significant_means =
    cpdb_file_path = cpdb_file_path,                 # mandatory: CellPhoneDB database zip file.
    meta_file_path = meta_file_path,                 # mandatory: tsv file defining barcodes to cell label.
    counts_file_path = counts_file_path,             # mandatory: normalized count matrix.
    counts_data = 'gene_name',                     # defines the gene annotation in counts matrix.
    #microenvs_file_path = microenvs_file_path,       # optional (default: None): defines cells per microenvironment.
    iterations = 100,                               # denotes the number of shufflings performed in the analysis.
    threshold = 0.1,                                 # defines the min % of cells expressing a gene for this to be employed in the analysis.
    threads = 15,                                     # number of threads to use in the analysis.
    debug_seed = 42,                                 # debug randome seed. To disable >=0.
    result_precision = 3,                            # Sets the rounding for the mean values in significan_means.
    pvalue = 1,                                   # P-value threshold to employ for significance.
    subsampling = True,                              # To enable subsampling the data (geometri sketching).
    subsampling_log = False,                         # (mandatory) enable subsampling log1p for non log-transformed data inputs.
    subsampling_num_pc = 100,                        # Number of componets to subsample via geometric skectching (dafault: 100).
    subsampling_num_cells = 10000,                    # Number of cells to subsample (integer) (default: 1/3 of the dataset).
    separator = '|',                                 # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
    debug = False,                                   # Saves all intermediate tables employed during the analysis in pkl format.
    output_path = "./out_total_repository/result_folder_124068+35287_slurm",                          # Path to save results.
    output_suffix = None                            # Replaces the timestamp in the output files by a user defined string in the  (default: None).

Reading user files...
The following user files were loaded successfully:
[ ][CORE][15/03/23-19:03:04][INFO] Subsampling 129668 to 10000
[ ][CORE][15/03/23-19:03:04][WARNING] Subsampling failed: ignored.
[ ][CORE][15/03/23-19:03:04][INFO] [Cluster Statistical Analysis] Threshold:0.1 Iterations:100 Debug-seed:42 Threads:15 Precision:3
[ ][CORE][15/03/23-19:03:04][WARNING] Debug random seed enabled. Set to 42
[ ][CORE][15/03/23-19:03:05][INFO] Running Real Analysis
[ ][CORE][15/03/23-19:03:06][INFO] Running Statistical Analysis
100%|██████████| 100/100 [00:17<00:00,  5.67it/s]
[ ][CORE][15/03/23-19:03:39][INFO] Building Pvalues result
[ ][CORE][15/03/23-19:03:39][INFO] Building results
jinlanzhan commented 1 year ago

Hello. Thank you guys for making CellPhoneDB, this is a great resource.

ktroule commented 1 year ago


I've tried to reproduce your error but I've not been able to do it. Can you confirm that you have installed CellPhoneDB in a new conda environment with pip install cellphonedb?

Kind regards

deeKal commented 1 year ago

Hi, I have the exact same problem. I just installed cellphonedb in a new conda environment, and when I try to run

means, deconvoluted =
    cpdb_file_path = cpdb_file_path,           
     meta_file_path = meta_file_path,          
     counts_file_path = counts_file_path,      
     counts_data = 'ensembl_gene_id',
     output_path = out_path,                  
     separator = '|',                          
     threshold = 0.1,                         
     result_precision = 3,                     
     debug = False,                           
     output_suffix = None                 

I get the following:

[ ][CORE][11/06/23-20:12:14][INFO] [Non Statistical Method] Threshold:0.1 Precision:3
Reading user files...
The following user files were loaded successfully:
[ ][CORE][11/06/23-20:13:24][INFO] Running Real Analysis
[ ][CORE][11/06/23-20:13:24][INFO] Building results
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/userpath/.local/lib/python3.9/site-packages/cellphonedb/src/core/methods/", line 143, in call
    means_result, significant_means, deconvoluted_result = build_results(
  File "userpath/.local/lib/python3.9/site-packages/cellphonedb/src/core/methods/", line 254, in build_results
    means_result = pd.concat([interactions_data_result, mean_analysis], axis=1, join='inner', sort=False)
  File "/userpath/.conda/envs/cpdb/lib/python3.9/site-packages/pandas/core/reshape/", line 385, in concat
    return op.get_result()
  File "/userpath/.conda/envs/cpdb/lib/python3.9/site-packages/pandas/core/reshape/", line 612, in get_result
    indexers[ax] = obj_labels.get_indexer(new_labels)
  File "/userpath/.conda/envs/cpdb/lib/python3.9/site-packages/pandas/core/indexes/", line 3731, in get_indexer
    raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Here, in the previous version of the tool, a similar error was due to the pandas version. I run python 3.9, pandas 2.0.2 (required >1.5.0), and cellphonedb 4.0.0, if that makes any difference.


ktroule commented 1 year ago


I've checked and no error appears with python 3.8.16 pandas 2.0.1 or python3.10.9 and pandas 2.0.2. We are going to have an eye on this.

datasome commented 1 year ago

Hi deeKal, Is it possible that you might have some duplicated columns in your counts file? (see: Kind regards, Robert.

deeKal commented 1 year ago

Thank you for your answer!

After debugging the corresponding code, I saw that it crashes in build_results() and more specifically, in means_result = pd.concat([interactions_data_result, mean_analysis], axis=1, join='inner', sort=False).

The interactions_data_result dataframe has duplicate indexes, although the interactions are different. I saved the two dataframes in text files which you can find here.

I hope this helps.

In the meantime, I'll try to run cellphoneDB using other versions of the packages.

datasome commented 1 year ago

Hi Despina, Many thanks for sharing your counts file - you've helped us fix a genuine bug that our existing test data sets had failed to pick up. Please could you do: pip install "git+" and try again? Your CellphoneDB analysis should complete successfully now. In any case, please let us know of any further issues and best of luck with your research. Best, Robert.

deeKal commented 1 year ago

It run! Thanks so much Robert!