hnolCol / ComplexFinder

Finds complexes from Blue-Native and SEC Fractionation Complexome Profiling Data. Each fraction is usually analysed by Liquid Chromatography coupled to Mass Spectrometry. (LC-MS/MS)
MIT License
7 stars 3 forks source link

noDatabaseForPredictions=True fails with KeyError: 'Cluster Labels' #17

Open aretaon opened 2 months ago

aretaon commented 2 months ago

Error description

ComplexFinder does not run in no-database mode.

How to reproduce

Starting from the example files provided the following code works:

from ComplexFinder.src.main import ComplexFinder
X = "ComplexFinder/example-data/D0"
ComplexFinder(analysisName = "ExampleRun_01",
              runName = "ExampleRun_01_noDB",
              idColumn = "Uniprot ID",
              noDatabaseForPredictions=True,
              grouping={'WT': 'D0_aebersold.txt'}).run(X)

returns

-> 1672 df = df.sort_values(by="Cluster Labels") KeyError: 'Cluster Labels'

However, running the same code with noDatabaseForPredictions=False leads to no errors.

System properties

OS: Fedora 40 Python: 3.8 Dependencies: jupyter = ">=1.1.1,<2" asteval = "<=0.9.19" certifi = "<=2022.12.7" cycler = "<=0.10.0" cython = "<=0.29.21" future = "<=0.18.2" hdbscan = "<=0.8.29" joblib = "<=1.2.0" kiwisolver = "<=1.3.1" llvmlite = "<=0.34.0" lmfit = "<=1.0.1" matplotlib = "<=3.3.2" numba = "<=0.51.2" numpy = "<=1.22.0" pandas = "<=1.1.4" pillow = "<=9.3.0" pyparsing = "<=2.4.7" python-dateutil = "<=2.8.1" pytz = "<=2020.4" scikit-learn = "<=0.23.2" scipy = "<=1.5.4" seaborn = "<=0.11.0" six = "<=1.15.0" threadpoolctl = "<=2.1.0" umap-learn = ">=0.5.0" uncertainties = "<=3.1.4" imbalanced-learn = ">=0.7.0,<0.8"

Full traceback:

KeyError Traceback (most recent call last) /tmp/ipykernel_1245616/3156604555.py in ?() ----> 1 ComplexFinder(analysisName = "ExampleRun_01", 2 runName = "ExampleRun_01_noDB", 3 idColumn = "Uniprot ID", 4 noDatabaseForPredictions=True,

~/nvme_data/Projects/ComplexFinder_debug/ComplexFinder/src/main.py in ?(self, X, maxValueToOne) 2207 groupFileNames = [groupFileNames] 2208 self._clusterInteractions(combinedInteractions,groupFiles = groupFileNames,groupName = groupName) 2209 else: 2210 print("Info :: Cluster Interactions") -> 2211 self._clusterInteractions(None) 2212 2213 2214 self.params["runTimes"]["Interaction Clustering and Embedding"] = time.time() - endTrainingTime

~/nvme_data/Projects/ComplexFinder_debug/ComplexFinder/src/main.py in ?(self, predInts, clusterMethod, plotEmbedding, groupFiles, combineProbs, groupName) 1668 umapKwargs = self.params["umapDefaultKwargs"], 1669 generateSquareMatrix = True, 1670 ) 1671 df = pd.DataFrame().from_dict({"Entry":intLabels,"Cluster Labels({})".format(analysisName):clusterLabels,"reachability":reachability,"core_distances":core_distances}) -> 1672 df = df.sort_values(by="Cluster Labels") 1673 df = df.set_index("Entry") 1674 1675 if pooledDistances is not None:

~/nvme_data/Projects/ComplexFinder_debug/.pixi/envs/default/lib/python3.8/site-packages/pandas/core/frame.py in ?(self, by, axis, ascending, inplace, kind, na_position, ignore_index, key) 5294 else: 5295 from pandas.core.sorting import nargsort 5296 5297 by = by[0] -> 5298 k = self._get_label_or_level_values(by, axis=axis) 5299 5300 # need to rewrap column in Series to apply key function 5301 if key is not None:

~/nvme_data/Projects/ComplexFinder_debug/.pixi/envs/default/lib/python3.8/site-packages/pandas/core/generic.py in ?(self, key, axis) 1559 values = self.xs(key, axis=other_axes[0])._values 1560 elif self._is_level_reference(key, axis=axis): 1561 values = self.axes[axis].get_level_values(key)._values 1562 else: -> 1563 raise KeyError(key) 1564 1565 # Check for duplicates 1566 if values.ndim > 1:

KeyError: 'Cluster Labels'

aretaon commented 1 month ago

Apperently the issue is differences in column naming between the part where a dataframe is written and a few lines below where the same column is used for sorting, e.g.

-                    noNoiseIndex = df.index[df["Cluster Labels"] > 0]
+                    noNoiseIndex = df.index[df["Cluster Labels({})".format(analysisName)] > 0]

fixing this a few times (see complete diff below) also fixes the error. If this is the indended behaviour please feel free to patch with the diff.


diff --git a/src/main.py b/src/main.py
index e7be847..33cbf41 100644
--- a/src/main.py
+++ b/src/main.py
@@ -1670,7 +1670,7 @@ class ComplexFinder(object):
                                                                                                     generateSquareMatrix = True,
                                                                                                     )
                     df = pd.DataFrame().from_dict({"Entry":intLabels,"Cluster Labels({})".format(analysisName):clusterLabels,"reachability":reachability,"core_distances":core_distances})
-                    df = df.sort_values(by="Cluster Labels")
+                    df = df.sort_values(by="Cluster Labels({})".format(analysisName))
                     df = df.set_index("Entry")

                     if pooledDistances is not None:
@@ -1679,7 +1679,7 @@ class ComplexFinder(object):
                     squaredDf = pd.DataFrame(matrix,columns=intLabels,index=intLabels).loc[df.index,df.index]
                     squaredDf.to_csv(os.path.join(pathToFolder,"SquaredSorted_{}.txt".format(self.currentAnalysisName)),sep="\t")

-                    noNoiseIndex = df.index[df["Cluster Labels"] > 0]
+                    noNoiseIndex = df.index[df["Cluster Labels({})".format(analysisName)] > 0]

                     squaredDf.loc[noNoiseIndex,noNoiseIndex].to_csv(os.path.join(pathToFolder,"NoNoiseSquaredSorted_{}.txt".format(self.currentAnalysisName)),sep="\t")
                     splitLabels = True
@@ -1691,7 +1691,7 @@ class ComplexFinder(object):
                     dfEmbed["clusterLabels({})".format(analysisName)] = clusterLabels
                     dfEmbed["labels({})".format(analysisName)] = intLabels
                     if splitLabels:
-                        dfEmbed["sLabels"] = dfEmbed["labels"].str.split("_",expand=True).values[:,0]
+                        dfEmbed["sLabels"] = dfEmbed["labels({})".format(analysisName)].str.split("_",expand=True).values[:,0]
                         dfEmbed = dfEmbed.set_index("sLabels")
                     else:
                         dfEmbed = dfEmbed.set_index("labels({})".format(analysisName))
hnolCol commented 1 month ago

Thank you very much and please excuse the late reply. I will double-check tomorrow afternoon and then accept/edit. Thanks again! Cheers Hendrik