tabdelaal / SpaGE

Enhancing spatial transcriptomics data by predicting the expression of unmeasured genes from a dissociated scRNA-seq data
MIT License
25 stars 2 forks source link

NaN problem in fitting the model #2

Open HelloWorldLTY opened 10 months ago

HelloWorldLTY commented 10 months ago

Hi, in the model fitting step, I notice that there is a problem:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[26], line 8
      6 for i in Gene_set:
      7     print(i)
----> 8     Imp_Genes = SpaGE(osmFISH_data.T.drop(i,axis=1),RNA_data.T,n_pv=30,
      9                            genes_to_predict = [i])
     10     print(Imp_Genes.shape)
     11     Correlations[i] = st.spearmanr(osmFISH_data[i],Imp_Genes[i])[0]

File /gpfs/gibbs/pi/zhao/tl688/tangram/SpaGE/SpaGE/main.py:74, in SpaGE(Spatial_data, RNA_data, n_pv, genes_to_predict)
     64 Imp_Genes = pd.DataFrame(np.zeros((Spatial_data.shape[0],len(genes_to_predict))),
     65                              columns=genes_to_predict)
     67 pv_Spatial_RNA = PVComputation(
     68         n_factors = n_pv,
     69         n_pv = n_pv,
     70         dim_reduction = 'pca',
     71         dim_reduction_target = 'pca'
     72 )
---> 74 pv_Spatial_RNA.fit(Common_data,Spatial_data_scaled[Common_data.columns])
     76 S = pv_Spatial_RNA.source_components_.T
     78 Effective_n_pv = sum(np.diag(pv_Spatial_RNA.cosine_similarity_matrix_) > 0.3)

File /gpfs/gibbs/pi/zhao/tl688/tangram/SpaGE/SpaGE/principal_vectors.py:119, in PVComputation.fit(self, X_source, X_target, y_source)
    103     """
    104     Compute the common factors between two set of data.
    105     IMPORTANT: Same genes have to be given for source and target, and in same order
   (...)
    116     self: returns an instance of self.
    117     """
    118     # Compute factors independently for source and target. Orthogonalize the basis
--> 119     Ps = self.dim_reduction_source.fit(X_source, y_source).components_
    120     Ps = scipy.linalg.orth(Ps.transpose()).transpose()
    122     Pt = self.dim_reduction_target.fit(X_target, y_source).components_

File /gpfs/gibbs/project/zhao/tl688/conda_envs/tangram-env/lib/python3.8/site-packages/sklearn/base.py:1151, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1144     estimator._validate_params()
   1146 with config_context(
   1147     skip_parameter_validation=(
   1148         prefer_skip_nested_validation or global_skip_validation
   1149     )
   1150 ):
-> 1151     return fit_method(estimator, *args, **kwargs)

File /gpfs/gibbs/project/zhao/tl688/conda_envs/tangram-env/lib/python3.8/site-packages/sklearn/decomposition/_pca.py:434, in PCA.fit(self, X, y)
    416 @_fit_context(prefer_skip_nested_validation=True)
    417 def fit(self, X, y=None):
    418     """Fit the model with X.
    419 
    420     Parameters
   (...)
    432         Returns the instance itself.
    433     """
--> 434     self._fit(X)
    435     return self

File /gpfs/gibbs/project/zhao/tl688/conda_envs/tangram-env/lib/python3.8/site-packages/sklearn/decomposition/_pca.py:483, in PCA._fit(self, X)
    477 if issparse(X):
    478     raise TypeError(
    479         "PCA does not support sparse input. See "
    480         "TruncatedSVD for a possible alternative."
    481     )
--> 483 X = self._validate_data(
    484     X, dtype=[np.float64, np.float32], ensure_2d=True, copy=self.copy
    485 )
    487 # Handle n_components==None
    488 if self.n_components is None:

File /gpfs/gibbs/project/zhao/tl688/conda_envs/tangram-env/lib/python3.8/site-packages/sklearn/base.py:604, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
    602         out = X, y
    603 elif not no_val_X and no_val_y:
--> 604     out = check_array(X, input_name="X", **check_params)
    605 elif no_val_X and not no_val_y:
    606     out = _check_y(y, **check_params)

File /gpfs/gibbs/project/zhao/tl688/conda_envs/tangram-env/lib/python3.8/site-packages/sklearn/utils/validation.py:959, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    953         raise ValueError(
    954             "Found array with dim %d. %s expected <= 2."
    955             % (array.ndim, estimator_name)
    956         )
    958     if force_all_finite:
--> 959         _assert_all_finite(
    960             array,
    961             input_name=input_name,
    962             estimator_name=estimator_name,
    963             allow_nan=force_all_finite == "allow-nan",
    964         )
    966 if ensure_min_samples > 0:
    967     n_samples = _num_samples(array)

File /gpfs/gibbs/project/zhao/tl688/conda_envs/tangram-env/lib/python3.8/site-packages/sklearn/utils/validation.py:124, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    121 if first_pass_isfinite:
    122     return
--> 124 _assert_all_finite_element_wise(
    125     X,
    126     xp=xp,
    127     allow_nan=allow_nan,
    128     msg_dtype=msg_dtype,
    129     estimator_name=estimator_name,
    130     input_name=input_name,
    131 )

File /gpfs/gibbs/project/zhao/tl688/conda_envs/tangram-env/lib/python3.8/site-packages/sklearn/utils/validation.py:173, in _assert_all_finite_element_wise(X, xp, allow_nan, msg_dtype, estimator_name, input_name)
    156 if estimator_name and input_name == "X" and has_nan_error:
    157     # Improve the error message on how to handle missing values in
    158     # scikit-learn.
    159     msg_err += (
    160         f"\n{estimator_name} does not accept missing values"
    161         " encoded as NaN natively. For supervised learning, you might want"
   (...)
    171         "#estimators-that-handle-nan-values"
    172     )
--> 173 raise ValueError(msg_err)

ValueError: Input X contains NaN.
PCA does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

Could you please help me address this problem? Thanks a lot.

tabdelaal commented 10 months ago

Hi, as far as I can see, the problem is in your RNA data input that it contains NaN values.

Can you actually check if that's the case?? And if yes, I believe changing these NaN values to zeros should solve it.

Cheers

HelloWorldLTY commented 9 months ago

Hi, I have checked. My rna input does not contain NaN, since I can run gimvi and tangram. And

np.sum(np.isnan(RNA_data.T.values)*1)

output is 0.

I think I figure out the reason. I need to filter low expression genes. However, if my target gene is low expressed, what should I do? Thanks.

Neermita18 commented 3 weeks ago

Hi, I have checked. My rna input does not contain NaN, since I can run gimvi and tangram. And

np.sum(np.isnan(RNA_data.T.values)*1)

output is 0.

I think I figure out the reason. I need to filter low expression genes. However, if my target gene is low expressed, what should I do? Thanks.

Yes, you are correct. The problem lies with some genes having a 0 expression in all the cells. Since SpaGE uses z-score normalization too, it causes the std deviation to be 0, which further causes NaN values (0 in the denominator). To filter out genes with all 0s (keep lowly expressed genes intact), you can use- Genes_count = np.sum(RNA_data > 0, axis=1) RNA_data = RNA_data.loc[Genes_count >0, :]

I've created a pull request where Common_data itself filters out the genes with NaN values in all the cells (after z-score)