Closed davidhbrann closed 4 years ago
I think the solution here is to be able to specify which key is used to filter features on.
@gokceneraslan, what is adata.var["highly_variable"]
supposed to mean if the batch key was specified has been run? I've checked with a few datasets and each time all the values were false.
Maybe a solution would be to set highly_variable
equal to highly_variable_intersection
when using the batch_key
. I think highly_variable
is a remnant of using highly_variable_genes_single_batch()
(or whatever the function is called) to get the individual per-batch HVGs for intersection calculation. @gokceneraslan will be able to correct me here though.
Maybe a solution would be to set
highly_variable
equal tohighly_variable_intersection
when using thebatch_key
. I thinkhighly_variable
is a remnant of usinghighly_variable_genes_single_batch()
(or whatever the function is called) to get the individual per-batch HVGs for intersection calculation. @gokceneraslan will be able to correct me here though.
Encountered this exact issue today. In my example, highly_variable_intersection
only contains 17 genes across 30 datasets, which I imagine might silently give unexpected results downstream. In addition to that option, another option might be to allow the user to define a minimum number of highly_variable_nbatches
so highly_variable
is derived from highly_variable_nbatches > NUMBER
. This is an approach used here FWIW: https://nbisweden.github.io/workshop-scRNAseq/labs/compiled/scanpy/scanpy_03_integration.html
adata.var["highly_variable"] and adata.var["highly_variable_intersection"] have very different meanings and it's good to have them separate, I think. Considering that PCA looks for the genes marked True in adata.var["highly_variable"] (regardless of the value of the batch_key option), using adata.var["highly_variable_intersection"] for filtering is not a good idea.
If there is confusion between adata.var["highly_variable"] and adata.var["highly_variable_intersection"]:
If the user specifies n_top_genes, adata.var["highly_variable"] contains top variable genes in the list of genes sorted by number of batches they are detected as variable (ties broken using dispersion). If mean/dispersion filters are provided, we apply these cutoffs to mean mean/dispersion across batches to construct a unified adata.var["highly_variable"].
adata.var["highly_variable_intersection"] is a very strict definition that I personally avoid using at all, but it also depends on the experimental setting and batch_key itself.
Therefore, there is a mistake in the following code:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=10, min_disp=0.1, batch_key="source")
adata_hvg = adata[:, adata.var.highly_variable_intersection].copy()
sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True) # both the default None and True will error; see below
This possibly removes many genes that are identified as highly variable in adata.var.highly_variable because adata_hvg = adata[:, adata.var.highly_variable_intersection] keeps only a subset of highly variable genes (see the definitions above).
If one wants to use the strict definition, correct usage would be:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=10, min_disp=0.1, batch_key="source")
adata.var.highly_variable = adata.var.highly_variable_intersection
sc.tl.pca(adata_hvg, svd_solver='arpack', n_comps = 30, use_highly_variable=True) # both the default None and True will error; see below
which is what @LuckyMD proposes, IIUC.
I think what we should do here is to print a more informative error in PCA, smt like HVGs identified by sc.pp.highly_variable_genes cannot be found in adata.
.
I think the solution here is to be able to specify which key is used to filter features on.
Yeah, that might also work, but might also be too much flexibility, I'm not sure.
@gokceneraslan, what is
adata.var["highly_variable"]
supposed to mean if the batch key was specified has been run? I've checked with a few datasets and each time all the values were false.
Hmm, can you make a reproducible example? This should be a bug. How does the other fields like adata.var["highly_variable_nbatch"] and adata.var["highly_variable_intersection"] look? Maybe a separate issue would be a better place to discuss.
Hey @gokceneraslan,
I'm surprised at how you describe the contents of adata.var['highly_variable']
when batch_key
is set. I wrote a function that does pretty much exactly the same thing building upon use of batch_key
for our data integration benchmarking, as I thought this wasn't available in scanpy. I recall looking through the code and thinking this was missing. Maybe we can compare functions for that to see if we're doing exactly the same thing or not?
Oh interesting, I thought it was clear :) I mean you even contributed to the function, no?
I think we also discussed why not to use intersection by default in the PR: https://github.com/theislab/scanpy/pull/614#issuecomment-485875031
If intersection is not used by default, why would we write in the documentation that it acts as a lightweight batch correction method. I'm as surprised as you are :)
Edit: adata.var["highly_variable_intersection"] wasn't even implemented in the beginning of the PR.
@gokceneraslan here's a quick example:
import scanpy as sc
pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
sc.pp.highly_variable_genes(pbmc, batch_key="louvain")
assert not pbmc.var["highly_variable"].any()
Alternatively:
pbmc = sc.datasets.pbmc3k_processed().raw.to_adata()
pbmc.obs["batch"] = "a"
sc.pp.highly_variable_genes(pbmc, batch_key="batch")
assert not pbmc.var["highly_variable"].any()
pbmc.obs["batch"] = "a"
pbmc.obs["batch"][::2] = "b"
sc.pp.highly_variable_genes(pbmc, batch_key="batch")
assert not pbmc.var["highly_variable"].any()
Oh interesting, I thought it was clear :) I mean you even contributed to the function, no?
I think we also discussed why not to use intersection by default in the PR: #614 (comment)
If intersection is not used by default, why would we write in the documentation that it acts as a lightweight batch correction method. I'm as surprised as you are :)
Yes, I fixed sth and reorganized a bit. I also recall our disc on highly_variable_intersection
. However, I thought your organization of HVGs was only for the ranking in highly_variable_nbatches
. Didn't see it's also the default for highly_variable
. I never really looked at the docs... that would have given a hint... I still feel as though I have sth slightly different though if I recall. Will look more carefully once this benchmarking data integration thing is out.
@gokceneraslan here's a quick example:
Oh man, just noticed a horrible bug which leads to zero HVGs if batch_key is given but n_top_genes is not 😓 Somehow, highly_variable_genes with batch_key but without n_top_genes (which is the option I always use :) ) is never tested :/ Fixing now.
Fixed in #1180 .
With the new
batch_key
option inhighly_variable_genes
downstream functions like PCA can fail silently with the old defaults. The same is true forsc.pl.highly_variable_genes(adata)
which currently doesn't recognize the output key inadata.var
ishighly_variable_intersection
rather thanhighly_variable
.The
pca
code doesn't error here, becausehighly_variable_intersection
makes'highly_variable' in adata.var.keys()
evaluate toTrue
:Versions: