scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.93k stars 604 forks source link

sc.pp.filter_genes_dispersion() produces NaNs #172

Open LuckyMD opened 6 years ago

LuckyMD commented 6 years ago

I am trying to get the highly variable genes for a data set. The data set was normalized by fitting a negative binomial model and using the residuals as expression levels. This gives mean gene expression values that can be negative and are very close to 0.

When I use the command: disp_filter = sc.pp.filter_genes_dispersion(adata.X, min_mean=0, min_disp=0.5) I get very few differentially expressed genes.

Looking at the dispersions via disp_filter['dispersions'] shows that many dispersions appear to be NaN. And superficial inspection shows that the genes with negative means have NaN dispersions. This feels like it shouldn't be the case. It is possible to calculate the variance for the genes that have NaN dispersions. Are all negative dispersion values cast to NaN?

Changing the 'mean_mean' parameter to a negative value changes nothing.

falexwolf commented 6 years ago

This makes sense. Negative means lead to negative dispersion values and hence NaNs upon taking a log. https://github.com/theislab/scanpy/blob/457d3aa8b7dd7344914edc7991618067cab34dde/scanpy/preprocessing/simple.py#L299-L312

One simply shouldn't take a logarithm in this case, so, I suggest you pass log=False to the function.

Should we clarify the documentation: https://github.com/theislab/scanpy/blob/457d3aa8b7dd7344914edc7991618067cab34dde/scanpy/preprocessing/simple.py#L241-L243? Maybe put this right at the top so that it's not left unnoticed anymore?

LuckyMD commented 6 years ago

Ahh... this makes sense. And this makes it a bit dangerous as well. I would generally compute HVGs after batch correction.. and batch correction generally takes log-normalized data, so the data you have before performing this function will be log-normalized most of the time. I assume this is true not just for me.

Maybe log=False should be the default? Or at least a warning should be output I feel.

falexwolf commented 6 years ago

Yes, you're of course right. It's not the best default and has been written like that for historic reasons (figuring out the difference and benchmarking vs seurat and vs cell ranger). But simply changing something here messes up everything people have done.

I have to think of a quick way of checking whether data is logarithmized or not...

Maybe it's enough to simply check whether the data matrix still contains integers - then log should be True. Otherwise, a warning should be raised if log is True. Makes sense?

At some point, one might rethink the structure of filter_genes_dispersion... one could deprecate it at some point in favor of a new function extract_highly_variable_genes or something like this. Or one indeed changes the default behavior and prints a lot of warnings... Not very desirable though.

LuckyMD commented 6 years ago

Makes sense that backwards compatibility has to take priority.

Integer check will work for most cases (at least it checks if it's still count data). The only exception I can think of is for CPM/size factor normalized data where it would fail (although usually you'd do log-transformation after you normalize otherwise). I can't think of a better method atm either way.