scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.93k stars 603 forks source link

HVG by cell_ranger flavor, n_top_genes not working #662

Open mona-lit opened 5 years ago

mona-lit commented 5 years ago

Hi,

Im using scanpy 1.4.2 to analyze my data, using the following command:

sc.pp.highly_variable_genes(heart_cmc, flavor = 'cell_ranger', n_top_genes = 1000)

However, instead of getting 1000 HVG, it reports 1488 HVG. Similar thing happens with higher numbers of HVG (e.g. n_top_genes = 2000 returns 1999).

The scaling then fails with a following error: ValueError: The first guess on the deviance function returned a nan. This could be a boundary problem and should be reported.

Any suggestions on how to fix it? When I dont specify n_top_genes, the thing runs without problems. Thanks!

cartal commented 5 years ago

Any updates on this one @flying-sheep? I keep having the same issue.

flying-sheep commented 5 years ago

n_top_genes is used here:

https://github.com/theislab/scanpy/blob/6ac6440f154922027e7b416affc53be6d4a9978d/scanpy/preprocessing/_highly_variable_genes.py#L140-L148

I would assume that this only happens if there’s several genes with the exact same dispersion, is that possible?

We need a reproducible example, else we can’t help you further: Please give me some lines of code that I can paste into a notebook unchanged that will demonstrate the problem.

LuckyMD commented 5 years ago

The number of HVGs not being exactly 1000 or 2000 is quite normal as dispersions can be exactly the same. 1488 is surprisingly high though. Maybe your dataset is very sparse so that you have a lot of dispersion ties for low count genes.

I'm not sure what your issue with scaling is about though. Have you filtered out genes that are 0 using sc.pp.filter_genes()? This could be causing problems.