pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 23 forks source link

Tutorial question on filtering out by mitochondrial content #212

Closed ScienceComputing closed 11 months ago

ScienceComputing commented 11 months ago

Describe the issue Hello. In my python programming end, the code to filter out the cells with high mitochondrial contents cannot run successfully. Look forward to learning from your insights in this regard. Thanks!

What is the exact command that was run?

# for each cell compute fraction of counts in mito genes vs. all genes
# the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
adata.obs['percent_mito'] = np.sum(
    adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1

Command output (with --verbose flag)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[60], line 3
      1 # For each cell, compute fraction of counts in mito genes vs. all genes
      2 # the `.A1` is only necessary as X is sparse (to transform to a dense array after summing)
----> 3 adata.obs['percent_mito'] = np.sum(adata[:, mito_genes].X, axis=1).A1 / np.sum(adata.X, axis=1).A1

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/anndata/_core/anndata.py:1100, in AnnData.__getitem__(self, index)
   1098 def __getitem__(self, index: Index) -> "AnnData":
   1099     """Returns a sliced view of the object."""
-> 1100     oidx, vidx = self._normalize_indices(index)
   1101     return AnnData(self, oidx=oidx, vidx=vidx, asview=True)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/anndata/_core/anndata.py:1081, in AnnData._normalize_indices(self, index)
   1080 def _normalize_indices(self, index: Optional[Index]) -> Tuple[slice, slice]:
-> 1081     return _normalize_indices(index, self.obs_names, self.var_names)

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/anndata/_core/index.py:33, in _normalize_indices(index, names0, names1)
     31 ax0, ax1 = unpack_index(index)
     32 ax0 = _normalize_index(ax0, names0)
---> 33 ax1 = _normalize_index(ax1, names1)
     34 return ax0, ax1

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/anndata/_core/index.py:98, in _normalize_index(indexer, index)
     96         if np.any(positions < 0):
     97             not_found = indexer[positions < 0]
---> 98             raise KeyError(
     99                 f"Values {list(not_found)}, from {list(indexer)}, "
    100                 "are not valid obs/ var names or indices."
    101             )
    102         return positions  # np.ndarray[int]
    103 else:

KeyError: "Values ['ENSMUSG00000064336', 'ENSMUSG00000064337', 'ENSMUSG00000064338', 'ENSMUSG00000064339', 'ENSMUSG00000064340', 'ENSMUSG00000064341', 'ENSMUSG00000064342', 'ENSMUSG00000064343', 'ENSMUSG00000064344', 'ENSMUSG00000064345', 'ENSMUSG00000064346', 'ENSMUSG00000064347', 'ENSMUSG00000064348', 'ENSMUSG00000064349', 'ENSMUSG00000064350', 'ENSMUSG00000064351', 'ENSMUSG00000064352', 'ENSMUSG00000064353', 'ENSMUSG00000064354', 'ENSMUSG00000064355', 'ENSMUSG00000064356', 'ENSMUSG00000064357', 'ENSMUSG00000064358', 'ENSMUSG00000064359', 'ENSMUSG00000064360', 'ENSMUSG00000064361', 'ENSMUSG00000065947', 'ENSMUSG00000064363', 'ENSMUSG00000064364', 'ENSMUSG00000064365', 'ENSMUSG00000064366', 'ENSMUSG00000064367', 'ENSMUSG00000064368', 'ENSMUSG00000064369', 'ENSMUSG00000064370', 'ENSMUSG00000064371', 'ENSMUSG00000064372'], from ['ENSMUSG00000064336', 'ENSMUSG00000064337', 'ENSMUSG00000064338', 'ENSMUSG00000064339', 'ENSMUSG00000064340', 'ENSMUSG00000064341', 'ENSMUSG00000064342', 'ENSMUSG00000064343', 'ENSMUSG00000064344', 'ENSMUSG00000064345', 'ENSMUSG00000064346', 'ENSMUSG00000064347', 'ENSMUSG00000064348', 'ENSMUSG00000064349', 'ENSMUSG00000064350', 'ENSMUSG00000064351', 'ENSMUSG00000064352', 'ENSMUSG00000064353', 'ENSMUSG00000064354', 'ENSMUSG00000064355', 'ENSMUSG00000064356', 'ENSMUSG00000064357', 'ENSMUSG00000064358', 'ENSMUSG00000064359', 'ENSMUSG00000064360', 'ENSMUSG00000064361', 'ENSMUSG00000065947', 'ENSMUSG00000064363', 'ENSMUSG00000064364', 'ENSMUSG00000064365', 'ENSMUSG00000064366', 'ENSMUSG00000064367', 'ENSMUSG00000064368', 'ENSMUSG00000064369', 'ENSMUSG00000064370', 'ENSMUSG00000064371', 'ENSMUSG00000064372'], are not valid obs/ var names or indices."
Yenaled commented 11 months ago

What are the mito_genes that you're using?

Print out the mito_genes

ScienceComputing commented 11 months ago

What are the mito_genes that you're using?

Print out the mito_genes

mito_ensembl_ids = sc.queries.mitochondrial_genes("mmusculus", attrname="ensembl_gene_id")
mito_genes = mito_ensembl_ids["ensembl_gene_id"].values
mito_genes

array(['ENSMUSG00000064336', 'ENSMUSG00000064337', 'ENSMUSG00000064338', 'ENSMUSG00000064339', 'ENSMUSG00000064340', 'ENSMUSG00000064341', 'ENSMUSG00000064342', 'ENSMUSG00000064343', 'ENSMUSG00000064344', 'ENSMUSG00000064345', 'ENSMUSG00000064346', 'ENSMUSG00000064347', 'ENSMUSG00000064348', 'ENSMUSG00000064349', 'ENSMUSG00000064350', 'ENSMUSG00000064351', 'ENSMUSG00000064352', 'ENSMUSG00000064353', 'ENSMUSG00000064354', 'ENSMUSG00000064355', 'ENSMUSG00000064356', 'ENSMUSG00000064357', 'ENSMUSG00000064358', 'ENSMUSG00000064359', 'ENSMUSG00000064360', 'ENSMUSG00000064361', 'ENSMUSG00000065947', 'ENSMUSG00000064363', 'ENSMUSG00000064364', 'ENSMUSG00000064365', 'ENSMUSG00000064366', 'ENSMUSG00000064367', 'ENSMUSG00000064368', 'ENSMUSG00000064369', 'ENSMUSG00000064370', 'ENSMUSG00000064371', 'ENSMUSG00000064372'], dtype=object)

Appreciate it!

Yenaled commented 11 months ago

OK, but those IDs don't exist in your anndata object. You'll need to convert them to whatever form is in your anndata object (maybe they have a different id or maybe gene names are used).

ScienceComputing commented 11 months ago

OK, but those IDs don't exist in your anndata object. You'll need to convert them to whatever form is in your anndata object (maybe they have a different id or maybe gene names are used).

Many thanks for your insights!

You are right. I write the codes below and no error occurs.

import re
strings = adata.var_names
adata.var_names = [re.sub(r"\.\d+", "", x) for x in strings]
print("\nDisplay 30 modified gene names.")
print(adata.var_names[:30])

Display 30 modified gene names. Index(['ENSMUSG00000102693', 'ENSMUSG00000064842', 'ENSMUSG00000051951', 'ENSMUSG00000102851', 'ENSMUSG00000103377', 'ENSMUSG00000104017', 'ENSMUSG00000103025', 'ENSMUSG00000089699', 'ENSMUSG00000103201', 'ENSMUSG00000103147', 'ENSMUSG00000103161', 'ENSMUSG00000102331', 'ENSMUSG00000102348', 'ENSMUSG00000102592', 'ENSMUSG00000088333', 'ENSMUSG00000102343', 'ENSMUSG00000025900', 'ENSMUSG00000102948', 'ENSMUSG00000104123', 'ENSMUSG00000025902', 'ENSMUSG00000104238', 'ENSMUSG00000102269', 'ENSMUSG00000096126', 'ENSMUSG00000103003', 'ENSMUSG00000104328', 'ENSMUSG00000102735', 'ENSMUSG00000098104', 'ENSMUSG00000102175', 'ENSMUSG00000088000', 'ENSMUSG00000103265'], dtype='object')

It would be nice if the tutorial will add an ID conversion step, so that when learning and following your python codes, we won't meet any mismatch between the IDs in AnnData object and the queried IDs in the database.

Yenaled commented 11 months ago

Great! And unfortunately, those tutorials haven't been updated in a long time but, because we just released the new version of kallisto (0.50.0) and bustools (0.43.0), they will undergo complete reconstruction soon.