scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.9k stars 597 forks source link

PCA fails with batch highly-variable gene correction #1032

Closed davidhbrann closed 4 years ago

davidhbrann commented 4 years ago

With the new batch_key option in highly_variable_genes downstream functions like PCA can fail silently with the old defaults. The same is true for sc.pl.highly_variable_genes(adata) which currently doesn't recognize the output key in adata.var is highly_variable_intersection rather than highly_variable.

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=10, min_disp=0.1, batch_key="source")
adata_hvg = adata[:, adata.var.highly_variable_intersection].copy()
sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True) # both the default None and True will error; see below
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-125-322839e541fd> in <module>
----> 1 sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True)

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/scanpy/preprocessing/_simple.py in pca(data, n_comps, zero_center, svd_solver, random_state, return_info, use_highly_variable, dtype, copy, chunked, chunk_size)
    529             pca_ = TruncatedSVD(n_components=n_comps, random_state=random_state)
    530             X = adata_comp.X
--> 531         X_pca = pca_.fit_transform(X)
    532 
    533     if X_pca.dtype.descr != np.dtype(dtype).descr: X_pca = X_pca.astype(dtype)

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/decomposition/pca.py in fit_transform(self, X, y)
    358 
    359         """
--> 360         U, S, V = self._fit(X)
    361         U = U[:, :self.n_components_]
    362 

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/decomposition/pca.py in _fit(self, X)
    380 
    381         X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=True,
--> 382                         copy=self.copy)
    383 
    384         # Handle n_components==None

~/anaconda2/envs/scanpy/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    556                              " a minimum of %d is required%s."
    557                              % (n_features, array.shape, ensure_min_features,
--> 558                                 context))
    559 
    560     if warn_on_dtype and dtype_orig is not None and array.dtype != dtype_orig:

ValueError: Found array with 0 feature(s) (shape=(44495, 0)) while a minimum of 1 is required.

The pca code doesn't error here, because highly_variable_intersection makes 'highly_variable' in adata.var.keys() evaluate to True:

    if use_highly_variable is True and 'highly_variable' not in adata.var.keys():
        raise ValueError('Did not find adata.var[\'highly_variable\']. '
                         'Either your data already only consists of highly-variable genes '
                         'or consider running `pp.highly_variable_genes` first.')
    if use_highly_variable is None:
        use_highly_variable = True if 'highly_variable' in adata.var.keys() else False
    if use_highly_variable:
        logg.info('    on highly variable genes')
    adata_comp = adata[:, adata.var['highly_variable']] if use_highly_variable else adata
adata.var.keys()
Index(['mito', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts',
       'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts',
       'n_cells', 'highly_variable', 'means', 'dispersions',
       'dispersions_norm', 'highly_variable_nbatches',
       'highly_variable_intersection'],
      dtype='object')

Versions:

scanpy==1.4.5.post2 anndata==0.7.1 umap==0.3.10 numpy==1.17.0 scipy==1.3.0 pandas==0.24.2 scikit-learn==0.21.3 statsmodels==0.11.0dev0+630.g4565348 python-igraph==0.7.1 louvain==0.6.1

ivirshup commented 4 years ago

I think the solution here is to be able to specify which key is used to filter features on.

@gokceneraslan, what is adata.var["highly_variable"] supposed to mean if the batch key was specified has been run? I've checked with a few datasets and each time all the values were false.

LuckyMD commented 4 years ago

Maybe a solution would be to set highly_variable equal to highly_variable_intersection when using the batch_key. I think highly_variable is a remnant of using highly_variable_genes_single_batch() (or whatever the function is called) to get the individual per-batch HVGs for intersection calculation. @gokceneraslan will be able to correct me here though.

mssher07 commented 4 years ago

Maybe a solution would be to set highly_variable equal to highly_variable_intersection when using the batch_key. I think highly_variable is a remnant of using highly_variable_genes_single_batch() (or whatever the function is called) to get the individual per-batch HVGs for intersection calculation. @gokceneraslan will be able to correct me here though.

Encountered this exact issue today. In my example, highly_variable_intersection only contains 17 genes across 30 datasets, which I imagine might silently give unexpected results downstream. In addition to that option, another option might be to allow the user to define a minimum number of highly_variable_nbatches so highly_variable is derived from highly_variable_nbatches > NUMBER. This is an approach used here FWIW: https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/scanpy/scanpy_03_integration.html

gokceneraslan commented 4 years ago

adata.var["highly_variable"] and adata.var["highly_variable_intersection"] have very different meanings and it's good to have them separate, I think. Considering that PCA looks for the genes marked True in adata.var["highly_variable"] (regardless of the value of the batch_key option), using adata.var["highly_variable_intersection"] for filtering is not a good idea.

If there is confusion between adata.var["highly_variable"] and adata.var["highly_variable_intersection"]:

If the user specifies n_top_genes, adata.var["highly_variable"] contains top variable genes in the list of genes sorted by number of batches they are detected as variable (ties broken using dispersion). If mean/dispersion filters are provided, we apply these cutoffs to mean mean/dispersion across batches to construct a unified adata.var["highly_variable"].

adata.var["highly_variable_intersection"] is a very strict definition that I personally avoid using at all, but it also depends on the experimental setting and batch_key itself.

Therefore, there is a mistake in the following code:

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=10, min_disp=0.1, batch_key="source")
adata_hvg = adata[:, adata.var.highly_variable_intersection].copy()
sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True) # both the default None and True will error; see below

This possibly removes many genes that are identified as highly variable in adata.var.highly_variable because adata_hvg = adata[:, adata.var.highly_variable_intersection] keeps only a subset of highly variable genes (see the definitions above).

If one wants to use the strict definition, correct usage would be:

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=10, min_disp=0.1, batch_key="source")
adata.var.highly_variable = adata.var.highly_variable_intersection
sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True) # both the default None and True will error; see below

which is what @LuckyMD proposes, IIUC.

I think what we should do here is to print a more informative error in PCA, smt like HVGs identified by sc.pp.highly_variable_genes cannot be found in adata..

gokceneraslan commented 4 years ago

I think the solution here is to be able to specify which key is used to filter features on.

Yeah, that might also work, but might also be too much flexibility, I'm not sure.

@gokceneraslan, what is adata.var["highly_variable"] supposed to mean if the batch key was specified has been run? I've checked with a few datasets and each time all the values were false.

Hmm, can you make a reproducible example? This should be a bug. How does the other fields like adata.var["highly_variable_nbatch"] and adata.var["highly_variable_intersection"] look? Maybe a separate issue would be a better place to discuss.

LuckyMD commented 4 years ago

Hey @gokceneraslan,

I'm surprised at how you describe the contents of adata.var['highly_variable'] when batch_key is set. I wrote a function that does pretty much exactly the same thing building upon use of batch_key for our data integration benchmarking, as I thought this wasn't available in scanpy. I recall looking through the code and thinking this was missing. Maybe we can compare functions for that to see if we're doing exactly the same thing or not?

gokceneraslan commented 4 years ago

Oh interesting, I thought it was clear :) I mean you even contributed to the function, no?

I think we also discussed why not to use intersection by default in the PR: https://github.com/theislab/scanpy/pull/614#issuecomment-485875031

If intersection is not used by default, why would we write in the documentation that it acts as a lightweight batch correction method. I'm as surprised as you are :)

Edit: adata.var["highly_variable_intersection"] wasn't even implemented in the beginning of the PR.

ivirshup commented 4 years ago

@gokceneraslan here's a quick example:

import scanpy as sc
pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
sc.pp.highly_variable_genes(pbmc, batch_key="louvain")

assert not pbmc.var["highly_variable"].any()

Alternatively:

pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
pbmc.obs["batch"] = "a"
sc.pp.highly_variable_genes(pbmc, batch_key="batch")
assert not pbmc.var["highly_variable"].any()

pbmc.obs["batch"] = "a"
pbmc.obs["batch"][::2] = "b"
sc.pp.highly_variable_genes(pbmc, batch_key="batch")
assert not pbmc.var["highly_variable"].any()
LuckyMD commented 4 years ago

Oh interesting, I thought it was clear :) I mean you even contributed to the function, no?

I think we also discussed why not to use intersection by default in the PR: #614 (comment)

If intersection is not used by default, why would we write in the documentation that it acts as a lightweight batch correction method. I'm as surprised as you are :)

Yes, I fixed sth and reorganized a bit. I also recall our disc on highly_variable_intersection. However, I thought your organization of HVGs was only for the ranking in highly_variable_nbatches. Didn't see it's also the default for highly_variable. I never really looked at the docs... that would have given a hint... I still feel as though I have sth slightly different though if I recall. Will look more carefully once this benchmarking data integration thing is out.

gokceneraslan commented 4 years ago

@gokceneraslan here's a quick example:

Oh man, just noticed a horrible bug which leads to zero HVGs if batch_key is given but n_top_genes is not 😓 Somehow, highly_variable_genes with batch_key but without n_top_genes (which is the option I always use :) ) is never tested :/ Fixing now.

gokceneraslan commented 4 years ago

Fixed in #1180 .