perslab / CELLEX

CELLEX (CELL-type EXpression-specificity)
GNU General Public License v3.0
37 stars 9 forks source link

CELLEX crashes if duplicated cell_ids #8

Closed pascaltimshel closed 4 years ago

pascaltimshel commented 5 years ago

Problem: CELLEX crashes if data contains duplicate cell_ids. Solution: check for duplicated cell_ids before running function

data = pd.DataFrame(np.random.randint(0,100,size=(100, 5)), columns=list('ABCDD'))
data.head()
metadata = pd.DataFrame(data={"cell_type":["X","X","X", "Y", "Y"]}, index=data.columns)
metadata.head()
eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)
Preprocessing - running remove_non_expressed ... excluded 0 / 100 genes in 0 min 0 sec
Preprocessing - normalizing data ... data normalized in 0 min 0 sec
---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-17-537009d482a9> in <module>
      3 metadata = pd.DataFrame(data={"cell_type":["X","X","X", "Y", "Y"]}, index=data.columns)
      4 metadata.head()
----> 5 eso = cellex.ESObject(data=data, annotation=metadata, verbose=True)

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/cellex/esobject.py in __init__(self, data, annotation, remove_non_expressed, normalize, anova, verbose)
     53 
     54         if type(annotation) is pd.Series:
---> 55             annotation = data.columns.map(annotation, na_action="ignore").values.astype(str)
     56 
     57         if anova:

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/core/indexes/base.py in map(self, mapper, na_action)
   4872         from .multi import MultiIndex
   4873 
-> 4874         new_values = super()._map_values(mapper, na_action=na_action)
   4875 
   4876         attributes = self._get_attributes_dict()

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/core/base.py in _map_values(self, mapper, na_action)
   1275                 values = self.values
   1276 
-> 1277             indexer = mapper.index.get_indexer(values)
   1278             new_values = algorithms.take_1d(mapper._values, indexer)
   1279 

/tools/anaconda/envs/djw472/py3_PT/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_indexer(self, target, method, limit, tolerance)
   2976         if not self.is_unique:
   2977             raise InvalidIndexError(
-> 2978                 "Reindexing only valid with uniquely" " valued Index objects"
   2979             )
   2980 

InvalidIndexError: Reindexing only valid with uniquely valued Index objects
pascaltimshel commented 5 years ago

Better solution: automatic renaming of duplicated cell_ids?

tstannius commented 4 years ago

Fixed by @Satannius in 47845ba8a3cefdf1319344b96567e654cad9fe25