pinellolab / STREAM

STREAM: Single-cell Trajectories Reconstruction, Exploration And Mapping of single-cell data
http://stream.pinellolab.org
GNU Affero General Public License v3.0
168 stars 45 forks source link

st.map_new_data use subset of var_genes? #105

Closed ccshao closed 3 years ago

ccshao commented 3 years ago

In the mapping step, some of the var_genes in adata_ref are not found in adata_new. Would it be possible to add an option to set the var_genes to be used? Thanks!

huidongchen commented 3 years ago

Yes, you can try something like:

adata.obsm['var_genes'] = adata[:,your_gene_list].X.copy()
ccshao commented 3 years ago

Some additional errors after changing var_genes in uns an obsm. Might be other value needed to be updated?

#- wt and ko are two adata, want to map ko to wt
#- In wt var_genes, Fgf17 is not found in ko 
vgene = adata_wt.uns['var_genes'].intersection(adata_ko.var.index)
adata_wt.uns['var_genes'] = vgene
adata_wt.obsm['var_genes'] = adata_wt[:, vgene].X.copy()

adata_wt.obsm['var_genes'].shape
# (487, 749)

len(adata_wt.uns['var_genes'])
# 749

adata_wt.obsm['var_genes']

#- however, it claimed 750 again.
adata_combined = st.map_new_data(adata_wt, adata_ko)

Top variable genes are being used for mapping ...
method 'umap' is being used for mapping ...
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/miniconda3/envs/env_stream/lib/python3.7/site-packages/umap/umap_.py in transform(self, X)
   2055                 dmat = pairwise_distances(
-> 2056                     X, self._raw_data, metric=_m, **self._metric_kwds
   2057                 )

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, force_all_finite, **kwds)
   1778 
-> 1779     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1780 

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1359     if effective_n_jobs(n_jobs) == 1:
-> 1360         return func(X, Y, **kwds)
   1361 

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in _pairwise_callable(X, Y, metric, force_all_finite, **kwds)
   1379     """
-> 1380     X, Y = check_pairwise_arrays(X, Y, force_all_finite=force_all_finite)
   1381 

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy)
    160                          "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 161                              X.shape[1], Y.shape[1]))
    162 

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 749 while Y.shape[1] == 750

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-5-1d6046fd4e78> in <module>
----> 1 adata_combined = st.map_new_data(adata_wt, adata_ko)

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/stream/core.py in map_new_data(adata_ref, adata_new, use_radius)
   4101         if('trans_umap' in adata_ref.uns_keys()):
   4102             trans = adata_ref.uns['trans_umap']
-> 4103             adata_new.obsm['X_umap_mapping'] = trans.transform(input_data)
   4104             adata_new.obsm['X_dr'] = adata_new.obsm['X_umap_mapping'].copy()
   4105         else:

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/umap/umap_.py in transform(self, X)
   2061                     self._raw_data,
   2062                     metric=self._input_distance_func,
-> 2063                     kwds=self._metric_kwds,
   2064                 )
   2065             indices = np.argpartition(dmat, self._n_neighbors)[:, : self._n_neighbors]

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/umap/distances.py in pairwise_special_metric(X, Y, metric, kwds)
   1258             return metric(_X, _Y, *kwd_vals)
   1259 
-> 1260         return pairwise_distances(X, Y, metric=_partial_metric)
   1261     else:
   1262         special_metric_func = named_distances[metric]

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, force_all_finite, **kwds)
   1777         func = partial(distance.cdist, metric=metric, **kwds)
   1778 
-> 1779     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1780 
   1781 

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1358 
   1359     if effective_n_jobs(n_jobs) == 1:
-> 1360         return func(X, Y, **kwds)
   1361 
   1362     # enforce a threading backend to prevent data communication overhead

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in _pairwise_callable(X, Y, metric, force_all_finite, **kwds)
   1378     """Handle the callable case for pairwise_{distances,kernels}
   1379     """
-> 1380     X, Y = check_pairwise_arrays(X, Y, force_all_finite=force_all_finite)
   1381 
   1382     if X is Y:

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     70                           FutureWarning)
     71         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72         return f(**kwargs)
     73     return inner_f
     74 

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy)
    159         raise ValueError("Incompatible dimension for X and Y matrices: "
    160                          "X.shape[1] == %d while Y.shape[1] == %d" % (
--> 161                              X.shape[1], Y.shape[1]))
    162 
    163     return X, Y

ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 749 while Y.shape[1] == 750
huidongchen commented 3 years ago

Yeah, you are right.

The current mapping procedure only works when the query dataset contains all the variable genes in reference dataset since the set of variable genes in ref dataset were used to define the original manifold. To be able to get projected into the same manifold, the cells from query data needs to have the same set of genes.

In your case, Fgf17 was used to learn the WT(reference) space but does not exist in KO(query) dataset. So theoretically KO cells can not be projected onto WT since it's missing one feature/gene.

The easiest solution here is to remove 'Fgf17' from the variable gene set in your WT data after the step st.select_variable_genes(). Then you can perform st.dimension_reduction() and the rest of steps again. Removing one gene probably won't change much the result of WT data but it will get the mapping procedure to work for KO.

I hope the above is useful to you.

ccshao commented 3 years ago

Here is another strange error for me.

As suggested, the wt and ko now have the same genes. However, there is an error in st.map_new_data

KeyError: "None of [Index(['n_counts', 'n_cells', 'pct_cells'], dtype='object')] are in the [columns]"

Both wt and ko data have all three columsn in var. Below are the codes. Thanks!

1. Preprocess data

st.__version__
st.set_figure_params(dpi=300,
                     style='white',
                     figsize=[6.4, 5.8],
                     rc={'image.cmap': 'viridis'})

adata = st.read(file_name='../../01_data/wt_count.tsv.gz',
                workdir='./stream_result')
st.add_metadata(adata, file_name='../../01_data/wt_meta.tsv')
st.cal_qc(adata, assay='rna')
st.filter_features(adata, min_n_cells=3)

adata_ko = st.read(file_name='../../01_data/ko_count.tsv.gz',
                   workdir='./mapping/')
st.add_metadata(adata_ko, file_name='../../01_data/ko_meta.tsv')
st.cal_qc(adata_ko, assay='rna')
st.filter_features(adata_ko, min_n_cells=3)

tgene    = adata.var.index.intersection(adata_ko.var.index)
adata    = adata[:, tgene].copy()
adata_ko = adata_ko[:, tgene].copy()

###Normalize gene expression based on library size
st.normalize(adata, method='lib_size')
###Logarithmize gene expression
st.log_transform(adata)
###Remove mitochondrial genes
st.remove_mt_genes(adata)

###Normalize gene expression based on library size
st.normalize(adata_ko, method='lib_size')
###Logarithmize gene expression
st.log_transform(adata_ko)
###Remove mitochondrial genes
st.remove_mt_genes(adata_ko)

st.write(adata, file_name="adata_basic.pkl")
st.write(adata_ko, file_name="adata_ko_basic.pkl")

#- !! then perform stream on adata and save as stream_result.pkl

2. mapping.

adata_ko = st.read(file_name='adata_ko_basic.pkl')
adata_wt = st.read(file_name='stream_result.pkl')

adata_ko
Out[16]:
AnnData object with n_obs × n_vars = 596 × 14481
    obs: 'label', 'label_color', 'n_counts', 'n_genes', 'pct_genes', 'pct_mt'
    var: 'n_counts', 'n_cells', 'pct_cells'
    uns: 'workdir', 'label_color', 'assay'

adata_wt
Out[14]:
AnnData object with n_obs × n_vars = 487 × 14481
    obs: 'label', 'label_color', 'n_counts', 'n_genes', 'pct_genes', 'pct_mt', 'kmeans', 'node', 'branch_id', 'branch_id_alias', 'branch_lam', 'branch_dist', 'S0_pseudotime', 'S1_pseudotime'
    var: 'n_counts', 'n_cells', 'pct_cells'
    uns: 'workdir', 'label_color', 'assay', 'var_genes', 'trans_umap', 'params', 'epg', 'flat_tree', 'seed_epg', 'seed_flat_tree', 'ori_epg', 'epg_obj', 'ori_epg_obj', 'stream_S1', 'scaled_marker_expr', 'transition_markers'
    obsm: 'var_genes', 'X_umap', 'X_dr', 'X_spring', 'X_stream_S1'

#- !! I could not understand the error message.
adata_all = st.map_new_data(adata_wt, adata_ko)
Top variable genes are being used for mapping ...
method 'umap' is being used for mapping ...
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-17-d60b8ec659ab> in <module>
----> 1 adata_all = st.map_new_data(adata_wt, adata_ko)

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/stream/core.py in map_new_data(adata_ref, adata_new, use_radius)
   4129     shared_var_key = [x for x in adata_new.var_keys() if x in adata_ref.var_keys()]
   4130     adata_combined.obs = adata_combined.obs[shared_obs_key+['batch']]
-> 4131     adata_combined.var = adata_combined.var[shared_var_key]
   4132     adata_combined.uns['workdir'] = adata_new.uns['workdir']
   4133     for key in adata_ref.uns_keys():

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2910             if is_iterator(key):
   2911                 key = list(key)
-> 2912             indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
   2913
   2914         # take() does not accept boolean indexers

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1252             keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
   1253
-> 1254         self._validate_read_indexer(keyarr, indexer, axis, raise_missing=raise_missing)
   1255         return keyarr, indexer
   1256

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1296             if missing == len(indexer):
   1297                 axis_name = self.obj._get_axis_name(axis)
-> 1298                 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
   1299
   1300             # We (temporarily) allow for some missing keys with .loc, except in

KeyError: "None of [Index(['n_counts', 'n_cells', 'pct_cells'], dtype='object')] are in the [columns]"
huidongchen commented 3 years ago

Thanks for the feedback. This is indeed a bug in the function map_new_data().

The easiest solution is to rename the column names of adata_ko.var to make them distinct from those of adata_wt.var before adata_all = st.map_new_data(adata_wt, adata_ko):

adata_ko.var.columns = 'ko_'+adata_ko.var.columns
ccshao commented 3 years ago

Sorry I came back with another trivial issue. Now I successfully generated the combined adata but failed to plot the batch in obs. It raised error, "TypeError: Categorical is not ordered for operation min you can use .as_ordered() to change the Categorical to an ordered one"

Plotting with label variable works.

adata_all = st.map_new_data(adata_wt, adata_ko)

adata_all
AnnData object with n_obs × n_vars = 1083 × 14481
    obs: 'label', 'label_color', 'n_counts', 'n_genes', 'pct_genes', 'pct_mt', 'node', 'branch_id', 'branch_id_alias', 'branch_lam', 'branch_dist', 'S0_pseudotime', 'S1_pseudotime', 'batch', 'batch2'
    uns: 'workdir', 'epg', 'flat_tree', 'label_color', 'var_genes'
    obsm: 'var_genes', 'X_dr'

st.plot_dimension_reduction(adata_all,
                            color=['batch'],
                            show_graph=True,
                            show_text=False,
                            fig_name='dimension_reduction.png')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-f0fa6b6ec554> in <module>
      3                             show_graph=True,
      4                             show_text=False,
----> 5                             fig_name='dimension_reduction.png')

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/stream/core.py in plot_dimension_reduction(adata, n_components, comp1, comp2, comp3, color, key_graph, fig_size, fig_ncol, fig_legend_ncol, fig_legend_order, vmin, vmax, alpha, pad, w_pad, h_pad, show_text, show_graph, save_fig, fig_path, fig_name, plotly)
   1477                     # ax_i.get_legend().texts[0].set_text("")
   1478                 else:
-> 1479                     vmin_i = df_plot[ann].min() if vmin is None else vmin
   1480                     vmax_i = df_plot[ann].max() if vmax is None else vmax
   1481                     sc_i = ax_i.scatter(df_plot_shuf['Dim'+str(comp1+1)], df_plot_shuf['Dim'+str(comp2+1)],

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/generic.py in stat_func(self, axis, skipna, level, numeric_only, **kwargs)
  11467             return self._agg_by_level(name, axis=axis, level=level, skipna=skipna)
  11468         return self._reduce(
> 11469             func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only
  11470         )
  11471

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/series.py in _reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
   4237         if isinstance(delegate, ExtensionArray):
   4238             # dispatch to ExtensionArray interface
-> 4239             return delegate._reduce(name, skipna=skipna, **kwds)
   4240
   4241         else:

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/arrays/categorical.py in _reduce(self, name, skipna, **kwargs)
   2081         if func is None:
   2082             raise TypeError(f"Categorical cannot perform the operation {name}")
-> 2083         return func(skipna=skipna, **kwargs)
   2084
   2085     @deprecate_kwarg(old_arg_name="numeric_only", new_arg_name="skipna")

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200
    201         return cast(F, wrapper)

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/arrays/categorical.py in min(self, skipna, **kwargs)
   2104         """
   2105         nv.validate_min((), kwargs)
-> 2106         self.check_for_ordered("min")
   2107
   2108         if not len(self._codes):

~/miniconda3/envs/env_stream/lib/python3.7/site-packages/pandas/core/arrays/categorical.py in check_for_ordered(self, op)
   1444         if not self.ordered:
   1445             raise TypeError(
-> 1446                 f"Categorical is not ordered for operation {op}\n"
   1447                 "you can use .as_ordered() to change the "
   1448                 "Categorical to an ordered one\n"

TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one

#-  I have tried to order the category but no help
adata_all.obs['batch2'] = adata_all.obs['batch'].cat.as_ordered()
st.plot_dimension_reduction(adata_all,
                            color=['batch2'],
                            show_graph=True,
                            show_text=False,
                            fig_name='dimension_reduction.png')

ValueError: 'c' argument must be a color, a sequence of colors, or a sequence of numbers, not WT_CATTCTAAGCGACAGT-1-ref
huidongchen commented 3 years ago

This error is seemingly caused by the latest version of pandas. You can either

1) change the data type of adata_all.obs['batch] to string

adata_all.obs['batch'] = adata_all.obs['batch'].astype(str)

or 2) downgrade your pandas to '1.0.5', the error should be gone.

We did notice some functions are broken due to the recent updates of several dependencies. We will try to fix these issues in the next release.

ccshao commented 3 years ago

Thanks very much for all the tips! The first option works well.