mortazavilab / PyWGCNA

PyWGCNA is a Python package designed to do Weighted Gene Correlation Network analysis (WGCNA)
https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad415/7218311
MIT License
209 stars 48 forks source link

KeyError: "gene_id" in running " Comparing two PyWGCNA objects" #18

Closed tyaoi closed 1 year ago

tyaoi commented 1 year ago

Hi!

I successfully obtained the module, enrichr results, and network from each of the two own data. Next, I tried to compare these two PyWGCNA objects and got a KeyError regarding "gene_id". These two genExrp have the same ENSG ID, but the sample_id is completely different.

Any advice would be appreciated !!

nargesr commented 1 year ago

Hi! would you mind sending me your comparison pickle file?

tyaoi commented 1 year ago

Hi!

I am on a business trip to a conference and do not have the data at hand, which delayed my reply.

Since the size of each pickle file is too large to attach the file, I will describe below what I am able to confirm now.

First, the expression data used in this study were expression data downloaded from TCGA, and all phenotype data were categorical data. The latter were converted to dummy variables using the pandas.get_dummies() function and used.

We divided the data into two groups based on the labels of the cancer pathology classifications (5 types), ran PyWGCNA on each, and created a pickle file. When I ran the compareWGCNAs() function based on these two files, I got the message KeyError: 'gene_id'.

I was able to find probably the cause of the problem.

I read each of the two pickle files with the readWGCNA() function, executed the datExpr.var function, checked the matrix shape, and found that the number of lines, or gene Id, was different.

This discrepancy is probably the cause of the KeyError.

When executing each PyWGCNA, there was no discrepancy in the number of gene id in the expression matrix at the stage of completing the pre-processing workflow.

Perhaps the difference in the results of executing the datExpr.var function above was caused during the execution of the analyzeWGCNA() function. However, geneList uses the same pandas DataFrame created by extracting only the protein_coding type.

tyaoi commented 1 year ago

Hi!

I forgot to attach the output when running the function compareWGCNAs().


KeyError Traceback (most recent call last) File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/pandas/core/indexes/base.py:3803, in Index.get_loc(self, key, method, tolerance) 3802 try: -> 3803 return self._engine.get_loc(casted_key) 3804 except KeyError as err:

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'gene_id'

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last) Input In [13], in <cell line: 1>() ----> 1 A_AB_vs_B1_B2_B3 = PyWGCNA.comparePyWGCNAs(pyWGCNA_A_AB, pyWGCNA_B1_B2_B3) 2 A_AB_vs_B1_B2_B3.comparison

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/PyWGCNA/utils.py:55, in comparePyWGCNAs(WGCNA1, WGCNA2) 42 """ 43 Compare two WGCNAs 44
(...) 51 :rtype: Compare class 52 """ 53 compare = Comparison(name1=WGCNA1.name, name2=WGCNA2.name, 54 geneModule1=WGCNA1.datExpr.var, geneModule2=WGCNA2.datExpr.var) ---> 55 compare.compareNetworks() 57 return compare

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/PyWGCNA/comparison.py:77, in Comparison.compareNetworks(self) 75 count = 0 76 for moduleColor1 in moduleColors1: ---> 77 node1 = self.geneModule1.loc[self.geneModule1.moduleColors == moduleColor1, 'gene_id'].tolist() 78 genes = genes + node1 79 for moduleColor2 in moduleColors2:

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/pandas/core/indexing.py:1067, in _LocationIndexer.getitem(self, key) 1065 if self._is_scalar_access(key): 1066 return self.obj._get_value(*key, takeable=self._takeable) -> 1067 return self._getitem_tuple(key) 1068 else: 1069 # we by definition only have the 0th axis 1070 axis = self.axis or 0

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/pandas/core/indexing.py:1247, in _LocIndexer._getitem_tuple(self, tup) 1245 with suppress(IndexingError): 1246 tup = self._expand_ellipsis(tup) -> 1247 return self._getitem_lowerdim(tup) 1249 # no multi-index, so validate all of the indexers 1250 tup = self._validate_tuple_indexer(tup)

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/pandas/core/indexing.py:967, in _LocationIndexer._getitem_lowerdim(self, tup) 963 for i, key in enumerate(tup): 964 if is_label_like(key): 965 # We don't need to check for tuples here because those are 966 # caught by the _is_nested_tuple_indexer check above. --> 967 section = self._getitem_axis(key, axis=i) 969 # We should never have a scalar section here, because 970 # _getitem_lowerdim is only called after a check for 971 # is_scalar_access, which that would be. 972 if section.ndim == self.ndim: 973 # we're in the middle of slicing through a MultiIndex 974 # revise the key wrt to section by inserting an _NS

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/pandas/core/indexing.py:1312, in _LocIndexer._getitem_axis(self, key, axis) 1310 # fall thru to straight lookup 1311 self._validate_key(key, axis) -> 1312 return self._get_label(key, axis=axis)

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/pandas/core/indexing.py:1260, in _LocIndexer._get_label(self, label, axis) 1258 def _get_label(self, label, axis: int): 1259 # GH#5567 this will fail if the label is not present in the axis. -> 1260 return self.obj.xs(label, axis=axis)

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/pandas/core/generic.py:4041, in NDFrame.xs(self, key, axis, level, drop_level) 4039 if axis == 1: 4040 if drop_level: -> 4041 return self[key] 4042 index = self.columns 4043 else:

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/pandas/core/frame.py:3804, in DataFrame.getitem(self, key) 3802 if self.columns.nlevels > 1: 3803 return self._getitem_multilevel(key) -> 3804 indexer = self.columns.get_loc(key) 3805 if is_integer(indexer): 3806 indexer = [indexer]

File ~/anaconda3/envs/pywgcna_1.15.0_asia/lib/python3.9/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key, method, tolerance) 3803 return self._engine.get_loc(casted_key) 3804 except KeyError as err: -> 3805 raise KeyError(key) from err 3806 except TypeError: 3807 # If we have a listlike key, _check_indexing_error will raise 3808 # InvalidIndexError. Otherwise we fall through and re-raise 3809 # the TypeError. 3810 self._check_indexing_error(key)

KeyError: 'gene_id'

anfoss commented 1 year ago

Ditto for me. Seems that in the comparison.py compareNetworks function there is the following line

node1 = self.geneModule1.loc[self.geneModule1.moduleColors == moduleColor1, 'gene_id'].tolist() Where in my case the gene_id is as index. Not sure why is it happening but a quick fix is to add this before the loop

self.geneModule1['gene_id'] = self.geneModule1.index
self.geneModule2['gene_id'] = self.geneModule2.index

and everything works

tyaoi commented 1 year ago

Hi anfoss,

Your codes works well !!

Thank you so much !

So, I close here.