owkin / PyDESeq2

A Python implementation of the DESeq2 pipeline for bulk RNA-seq DEA.
https://pydeseq2.readthedocs.io/en/latest/
MIT License
586 stars 62 forks source link

Handling NaN counts #25

Open BorisMuzellec opened 1 year ago

BorisMuzellec commented 1 year ago

Currently, PyDESeq2 throws an error when trying to initialise a DeseqDataSet with a count matrix that contains NaNs – this is to reproduce DESeq2's behaviour.

As pointed out by @arthurPignetOwkin, it seems like it would make sense to simply raise a warning instead and carry on with the analysis, and return NaNs for dispersions, LFCs, and p-values of genes that have NaN counts (as we already do for genes whose counts are all-zero).

fairliereese commented 1 year ago

Just wanted to add on that I encountered this error when trying to run on a sparse matrix. When I densify it it is fine, but I am sure that users with large data matrices will appreciate being able to run without having to densify their matrices.

koh-joshua commented 1 year ago

Hopefully this gets implemented soon.

mortonjt commented 11 months ago

Hi, I'm also noticing that the default functionality breaks when the input data isn't densified ahead of time -- the internal validation functions assume that the input counts are dense numpy / pandas objects, despite the default AnnData behavior recasting these inputs into sparse matrices.

I'm not exactly sure how the tutorials on the main website are able to run in the first place -- I have not been able to run any of these tutorials (with new data) without recasting the via

from pydeseq2.dds import DeseqDataSet
dds = DeseqDataSet(counts=df, metadata=md)
dds.X = np.array(dds.X.todense())