InvalidIndexError: Reindexing only valid with uniquely valued Index objects

jinlanzhan commented 1 year ago

cpdb_file_path = '/group_homes/public_cluster/home/u37189/Project/20230102_smell_loss/cellphoneDB/out_total_repository/db/v4.1.0/cellphonedb.zip'
meta_file_path = '/group_homes/public_cluster/home/u37189/Project/20230102_smell_loss/cellphoneDB/124068cell+33genes_metadata.csv'
#counts_file_path = '/group_homes/public_cluster/home/u37189/Project/20230102_smell_loss/cellphoneDB/COVID_dataset_scvi_integration_129668_35259_addvirus_h5ad.h5ad'
counts_file_path = '/group_homes/public_cluster/home/u37189/Project/20230102_smell_loss/cellphoneDB/124068cell+33genes_log1p_normmatrix.h5ad'

from cellphonedb.src.core.methods import cpdb_statistical_analysis_method

deconvoluted, means, pvalues, significant_means = cpdb_statistical_analysis_method.call(
    cpdb_file_path = cpdb_file_path,                 # mandatory: CellPhoneDB database zip file.
    meta_file_path = meta_file_path,                 # mandatory: tsv file defining barcodes to cell label.
    counts_file_path = counts_file_path,             # mandatory: normalized count matrix.
    counts_data = 'gene_name',                     # defines the gene annotation in counts matrix.
    #microenvs_file_path = microenvs_file_path,       # optional (default: None): defines cells per microenvironment.
    iterations = 100,                               # denotes the number of shufflings performed in the analysis.
    threshold = 0.1,                                 # defines the min % of cells expressing a gene for this to be employed in the analysis.
    threads = 15,                                     # number of threads to use in the analysis.
    debug_seed = 42,                                 # debug randome seed. To disable >=0.
    result_precision = 3,                            # Sets the rounding for the mean values in significan_means.
    pvalue = 1,                                   # P-value threshold to employ for significance.
    subsampling = True,                              # To enable subsampling the data (geometri sketching).
    subsampling_log = False,                         # (mandatory) enable subsampling log1p for non log-transformed data inputs.
    subsampling_num_pc = 100,                        # Number of componets to subsample via geometric skectching (dafault: 100).
    subsampling_num_cells = 10000,                    # Number of cells to subsample (integer) (default: 1/3 of the dataset).
    separator = '|',                                 # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
    debug = False,                                   # Saves all intermediate tables employed during the analysis in pkl format.
    output_path = "./out_total_repository/result_folder_124068+35287_slurm",                          # Path to save results.
    output_suffix = None                            # Replaces the timestamp in the output files by a user defined string in the  (default: None).
    )

Reading user files...
The following user files were loaded successfully:
/group_homes/public_cluster/home/u37189/Project/20230102_smell_loss/cellphoneDB/124068cell+33genes_log1p_normmatrix.h5ad
/group_homes/public_cluster/home/u37189/Project/20230102_smell_loss/cellphoneDB/124068cell+33genes_metadata.csv
[ ][CORE][15/03/23-19:03:04][INFO] Subsampling 129668 to 10000
[ ][CORE][15/03/23-19:03:04][WARNING] Subsampling failed: ignored.
[ ][CORE][15/03/23-19:03:04][INFO] [Cluster Statistical Analysis] Threshold:0.1 Iterations:100 Debug-seed:42 Threads:15 Precision:3
[ ][CORE][15/03/23-19:03:04][WARNING] Debug random seed enabled. Set to 42
[ ][CORE][15/03/23-19:03:05][INFO] Running Real Analysis
[ ][CORE][15/03/23-19:03:06][INFO] Running Statistical Analysis
100%|██████████| 100/100 [00:17<00:00,  5.67it/s]
[ ][CORE][15/03/23-19:03:39][INFO] Building Pvalues result
[ ][CORE][15/03/23-19:03:39][INFO] Building results
---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
Cell In[73], line 3
      1 from cellphonedb.src.core.methods import cpdb_statistical_analysis_method
----> 3 deconvoluted, means, pvalues, significant_means = cpdb_statistical_analysis_method.call(
      4     cpdb_file_path = cpdb_file_path,                 # mandatory: CellPhoneDB database zip file.
      5     meta_file_path = meta_file_path,                 # mandatory: tsv file defining barcodes to cell label.
      6     counts_file_path = counts_file_path,             # mandatory: normalized count matrix.
      7     counts_data = 'gene_name',                     # defines the gene annotation in counts matrix.
      8     #microenvs_file_path = microenvs_file_path,       # optional (default: None): defines cells per microenvironment.
      9     iterations = 100,                               # denotes the number of shufflings performed in the analysis.
     10     threshold = 0.1,                                 # defines the min % of cells expressing a gene for this to be employed in the analysis.
     11     threads = 15,                                     # number of threads to use in the analysis.
     12     debug_seed = 42,                                 # debug randome seed. To disable >=0.
     13     result_precision = 3,                            # Sets the rounding for the mean values in significan_means.
     14     pvalue = 1,                                   # P-value threshold to employ for significance.
     15     subsampling = True,                              # To enable subsampling the data (geometri sketching).
     16     subsampling_log = False,                         # (mandatory) enable subsampling log1p for non log-transformed data inputs.
     17     subsampling_num_pc = 100,                        # Number of componets to subsample via geometric skectching (dafault: 100).
     18     subsampling_num_cells = 10000,                    # Number of cells to subsample (integer) (default: 1/3 of the dataset).
     19     separator = '|',                                 # Sets the string to employ to separate cells in the results dataframes "cellA|CellB".
     20     debug = False,                                   # Saves all intermediate tables employed during the analysis in pkl format.
     21     output_path = "./out_total_repository/result_folder_124068+35287_slurm",                          # Path to save results.
     22     output_suffix = None                            # Replaces the timestamp in the output files by a user defined string in the  (default: None).
     23     )

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/cellphonedb/src/core/methods/cpdb_statistical_analysis_method.py:108, in call(cpdb_file_path, meta_file_path, counts_file_path, counts_data, output_path, microenvs_file_path, iterations, threshold, threads, debug_seed, result_precision, pvalue, subsampling, subsampling_log, subsampling_num_pc, subsampling_num_cells, separator, debug, output_suffix)
    104     ss = subsampler.Subsampler(log=subsampling_log, num_pc=subsampling_num_pc, num_cells=subsampling_num_cells, verbose=False, debug_seed=None)
    105     counts = ss.subsample(counts)
    107 pvalues, means, significant_means, deconvoluted = \
--> 108     cpdb_statistical_analysis_complex_method.call(meta.copy(),
    109                                                   counts.copy(),
    110                                                   counts_data,
    111                                                   interactions,
    112                                                   genes,
    113                                                   complex_expanded,
    114                                                   complex_composition,
    115                                                   microenvs,
    116                                                   pvalue,
    117                                                   separator,
    118                                                   iterations,
    119                                                   threshold,
    120                                                   threads,
    121                                                   debug_seed,
    122                                                   result_precision,
    123                                                   debug,
    124                                                   output_path
    125                                                   )
    128 max_rank = significant_means['rank'].max()
    129 significant_means['rank'] = significant_means['rank'].apply(lambda rank: rank if rank != 0 else (1 + max_rank))

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/cellphonedb/src/core/methods/cpdb_statistical_analysis_complex_method.py:117, in call(meta, counts, counts_data, interactions, genes, complexes, complex_compositions, microenvs, pvalue, separator, iterations, threshold, threads, debug_seed, result_precision, debug, output_path)
    100     with open(f"{output_path}/debug_intermediate.pkl", "wb") as fh:
    101         pickle.dump({
    102             "genes": genes,
    103             "interactions": interactions,
   (...)
    114             "statistical_mean_analysis": statistical_mean_analysis,
    115             "result_percent": result_percent}, fh)
--> 117 pvalues_result, means_result, significant_means, deconvoluted_result = build_results(
    118     interactions_filtered,
    119     interactions,
    120     counts_relations,
    121     real_mean_analysis,
    122     result_percent,
    123     clusters['means'],
    124     complex_composition_filtered,
    125     counts,
    126     genes,
    127     result_precision,
    128     pvalue,
    129     counts_data
    130 )
    131 return pvalues_result, means_result, significant_means, deconvoluted_result

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/cellphonedb/src/core/methods/cpdb_statistical_analysis_complex_method.py:224, in build_results(interactions, interactions_original, counts_relations, real_mean_analysis, result_percent, clusters_means, complex_compositions, counts, genes, result_precision, pvalue, counts_data)
    220 significant_means_result = pd.concat([interactions_data_result, significant_mean_rank, significant_means], axis=1,
    221                                      join='inner', sort=False)
    223 # Document 5
--> 224 deconvoluted_result = deconvoluted_complex_result_build(clusters_means,
    225                                                         interactions,
    226                                                         complex_compositions,
    227                                                         counts,
    228                                                         genes,
    229                                                         counts_data)
    231 return pvalues_result, means_result, significant_means_result, deconvoluted_result

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/cellphonedb/src/core/methods/cpdb_statistical_analysis_complex_method.py:261, in deconvoluted_complex_result_build(clusters_means, interactions, complex_compositions, counts, genes, counts_data)
    252 deconvoluted_complex_result_2 = deconvolute_complex_interaction_component(complex_compositions,
    253                                                                           genes_filtered,
    254                                                                           interactions,
    255                                                                           '_2',
    256                                                                           counts_data)
    257 deconvoluted_simple_result_2 = deconvolute_interaction_component(interactions,
    258                                                                  '_2',
    259                                                                  counts_data)
--> 261 deconvoluted_result = pd.concat([deconvoluted_complex_result_1, deconvoluted_simple_result_1, deconvoluted_complex_result_2, deconvoluted_simple_result_2], sort=False)
    263 deconvoluted_result.set_index('multidata_id', inplace=True, drop=True)
    265 deconvoluted_columns = ['gene_name', 'name', 'is_complex', 'protein_name', 'complex_name', 'id_cp_interaction',
    266                         'gene']

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/pandas/core/reshape/concat.py:381, in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    159 """
    160 Concatenate pandas objects along a particular axis.
    161 
   (...)
    366 1   3   4
    367 """
    368 op = _Concatenator(
    369     objs,
    370     axis=axis,
   (...)
    378     sort=sort,
    379 )
--> 381 return op.get_result()

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/pandas/core/reshape/concat.py:612, in _Concatenator.get_result(self)
    610         obj_labels = obj.axes[1 - ax]
    611         if not new_labels.equals(obj_labels):
--> 612             indexers[ax] = obj_labels.get_indexer(new_labels)
    614     mgrs_indexers.append((obj._mgr, indexers))
    616 new_data = concatenate_managers(
    617     mgrs_indexers, self.new_axes, concat_axis=self.bm_axis, copy=self.copy
    618 )

File ~/miniconda3/envs/cpdb/lib/python3.8/site-packages/pandas/core/indexes/base.py:3904, in Index.get_indexer(self, target, method, limit, tolerance)
   3901 self._check_indexing_method(method, limit, tolerance)
   3903 if not self._index_as_unique:
-> 3904     raise InvalidIndexError(self._requires_unique_msg)
   3906 if len(target) == 0:
   3907     return np.array([], dtype=np.intp)

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

jinlanzhan commented 1 year ago

Hello. Thank you guys for making CellPhoneDB, this is a great resource.

ktroule commented 1 year ago

Hi.

I've tried to reproduce your error but I've not been able to do it. Can you confirm that you have installed CellPhoneDB in a new conda environment with pip install cellphonedb?

Kind regards

deeKal commented 1 year ago

Hi, I have the exact same problem. I just installed cellphonedb in a new conda environment, and when I try to run

means, deconvoluted = cpdb_analysis_method.call(
    cpdb_file_path = cpdb_file_path,           
     meta_file_path = meta_file_path,          
     counts_file_path = counts_file_path,      
     counts_data = 'ensembl_gene_id',
     output_path = out_path,                  
     separator = '|',                          
     threshold = 0.1,                         
     result_precision = 3,                     
     debug = False,                           
     output_suffix = None                 
 )

I get the following:

[ ][CORE][11/06/23-20:12:14][INFO] [Non Statistical Method] Threshold:0.1 Precision:3
Reading user files...
The following user files were loaded successfully:
counts.txt
metadata.txt
[ ][CORE][11/06/23-20:13:24][INFO] Running Real Analysis
[ ][CORE][11/06/23-20:13:24][INFO] Building results
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/userpath/.local/lib/python3.9/site-packages/cellphonedb/src/core/methods/cpdb_analysis_method.py", line 143, in call
    means_result, significant_means, deconvoluted_result = build_results(
  File "userpath/.local/lib/python3.9/site-packages/cellphonedb/src/core/methods/cpdb_analysis_method.py", line 254, in build_results
    means_result = pd.concat([interactions_data_result, mean_analysis], axis=1, join='inner', sort=False)
  File "/userpath/.conda/envs/cpdb/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 385, in concat
    return op.get_result()
  File "/userpath/.conda/envs/cpdb/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 612, in get_result
    indexers[ax] = obj_labels.get_indexer(new_labels)
  File "/userpath/.conda/envs/cpdb/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 3731, in get_indexer
    raise InvalidIndexError(self._requires_unique_msg)
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Here, in the previous version of the tool, a similar error was due to the pandas version. I run python 3.9, pandas 2.0.2 (required >1.5.0), and cellphonedb 4.0.0, if that makes any difference.

Thanks!

ktroule commented 1 year ago

Hi.

I've checked and no error appears with python 3.8.16 pandas 2.0.1 or python3.10.9 and pandas 2.0.2. We are going to have an eye on this.

datasome commented 1 year ago

Hi deeKal, Is it possible that you might have some duplicated columns in your counts file? (see: https://stackoverflow.com/questions/35084071/concat-dataframe-reindexing-only-valid-with-uniquely-valued-index-objects). Kind regards, Robert.

deeKal commented 1 year ago

Thank you for your answer!

After debugging the corresponding code, I saw that it crashes in build_results() and more specifically, in means_result = pd.concat([interactions_data_result, mean_analysis], axis=1, join='inner', sort=False).

The interactions_data_result dataframe has duplicate indexes, although the interactions are different. I saved the two dataframes in text files which you can find here.

I hope this helps.

In the meantime, I'll try to run cellphoneDB using other versions of the packages.

datasome commented 1 year ago

Hi Despina, Many thanks for sharing your counts file - you've helped us fix a genuine bug that our existing test data sets had failed to pick up. Please could you do: pip install "git+https://github.com/ventolab/CellphoneDB.git" and try again? Your CellphoneDB analysis should complete successfully now. In any case, please let us know of any further issues and best of luck with your research. Best, Robert.

deeKal commented 1 year ago

It run! Thanks so much Robert!

ventolab / CellphoneDB

InvalidIndexError: Reindexing only valid with uniquely valued Index objects #97