ENH Remove count DataFrame from calculate_cooks

owkin / PyDESeq2

A Python implementation of the DESeq2 pipeline for bulk RNA-seq DEA.

https://pydeseq2.readthedocs.io/en/latest/

MIT License

573 stars 60 forks source link

ENH Remove count DataFrame from calculate_cooks #292

Closed asistradition closed 2 months ago

asistradition commented 3 months ago

What does your PR implement? Be specific.

calculate_cooks casts normed_counts into a pandas DataFrame for robust_method_of_moments_disp. This is memory inefficient for large data.

robust_method_of_moments_disp has been refactored to accept an ndarray directly and the DataFrame has been removed. There is no numerical change as a result.

asistradition commented 3 months ago

The main advantage to using .obsm is that column slicing a row-major array requires a copy, and there is a considerable amount of overhead when calling .layers[key][:, filtered_genes] repeatedly.

As those keys are only used in cooks_distances it is a considerable optimization (for large data, e.g. 50k x 30k) to move them to .obsm remove those copies.

asistradition commented 2 months ago

Also includes an optional control_genes argument to fit_size_factors which has the same behavior as the controlGenes argument to estimateSizeFactors and the associated unit test

BorisMuzellec commented 2 months ago

Thanks @asistradition for this PR!

I agree with @umarteauowkin that on principle storing _mu_LFC and _hat_diagonals in the obsm and not in the layers is a bit awkward, but if this leads to memory gains I'm fine with it, given that (as you pointed out) they are only used in cooks_distances.