theislab / diffxpy

Differential expression analysis for single-cell RNA-seq data.
https://diffxpy.rtfd.io
BSD 3-Clause "New" or "Revised" License
179 stars 23 forks source link

Diffxpy returns log2fc and coef_mle values as dask array #223

Open lippspatrick opened 1 year ago

lippspatrick commented 1 year ago

Hello, thanks a lot for your package.

I'm currently facing an issue that after running walds test, the values for log2fc and coef_mle are returned as a dask array and not as a numeric value.

I'm using the following versions: anndata 0.7.8 diffxpy 0.7.4+103.gceb79aa ig/batchglm_api_update
batchglm 0.7.4. dev_0 dask 2021.4.1 sparse 0.9.1

Batchglm and diffxpy of those branches because of this issue #93.

I'm running the following code:

adata.X=adata.layers["raw_counts"] res=de.test.wald(data=adata, formula_loc="~ 1 + GROUP", factor_loc_totest="GROUP")

This prints:

The provided param_names are ignored as the design matrix is of type <class 'patsy.design_info.DesignMatrix'>. The provided param_names are ignored as the design matrix is of type <class 'patsy.design_info.DesignMatrix'>. /lippat00/python/diffxpy_packages/batchglm/batchglm/models/base_glm/utils.py:102: PerformanceWarning: Slicing is producing a large chunk. To accept the large chunk and silence this warning, set the option

with dask.config.set(**{'array.slicing.split_large_chunks': False}): ... array[indexer]

To avoid creating the large chunks, set the option

with dask.config.set({'array.slicing.split_large_chunks': True}): ... array[indexer] np.vstack([np.mean(densify(x[np.where(grouping == g)[0], :]), axis=0) for g in np.unique(grouping)]) /lippat00/python/diffxpy_packages/batchglm/batchglm/models/base_glm/utils.py:151: PerformanceWarning: Slicing is producing a large chunk. To accept the large chunk and silence this warning, set the option with dask.config.set({'array.slicing.split_large_chunks': False}): ... array[indexer]

To avoid creating the large chunks, set the option

with dask.config.set({'array.slicing.split_large_chunks': True}): ... array[indexer] np.vstack([np.mean(densify(x[np.where(grouping == g)[0], :]), axis=0) for g in np.unique(grouping)]) /lippat00/python/diffxpy_packages/batchglm/batchglm/models/base_glm/utils.py:168: PerformanceWarning: Slicing is producing a large chunk. To accept the large chunk and silence this warning, set the option with dask.config.set({'array.slicing.split_large_chunks': False}): ... array[indexer]

To avoid creating the large chunks, set the option

with dask.config.set(**{'array.slicing.split_large_chunks': True}): ... array[indexer] [np.mean(np.square(densify(x[np.where(grouping == g)[0], :])), axis=0) for g in np.unique(grouping)]

training location model: False training scale model: True iter 0: ll=70663625.678866 iter 1: ll=70663625.678866, converged: 0.00% (loc: 100.00%, scale update: False), in 0.00sec iter 2: ll=50303650.242279, converged: 11.08% (loc: 11.08%, scale update: True), in 177.80sec iter 3: ll=50303650.242279, converged: 11.08% (loc: 100.00%, scale update: False), in 0.00sec iter 4: ll=50076239.645603, converged: 79.02% (loc: 79.02%, scale update: True), in 149.27sec iter 5: ll=50076239.645603, converged: 79.02% (loc: 100.00%, scale update: False), in 0.00sec iter 6: ll=50039459.932892, converged: 92.41% (loc: 92.41%, scale update: True), in 59.11sec iter 7: ll=50039459.932892, converged: 92.41% (loc: 100.00%, scale update: False), in 0.00sec iter 8: ll=50031801.602276, converged: 98.12% (loc: 98.12%, scale update: True), in 44.64sec iter 9: ll=50031801.602276, converged: 98.12% (loc: 100.00%, scale update: False), in 0.00sec iter 10: ll=50030709.007629, converged: 99.70% (loc: 99.70%, scale update: True), in 41.18sec iter 11: ll=50030709.007629, converged: 99.70% (loc: 100.00%, scale update: False), in 0.00sec iter 12: ll=50030695.725333, converged: 99.97% (loc: 99.97%, scale update: True), in 36.95sec iter 13: ll=50030695.725333, converged: 99.97% (loc: 100.00%, scale update: False), in 0.00sec iter 14: ll=50030652.029279, converged: 99.99% (loc: 99.99%, scale update: True), in 23.52sec iter 15: ll=50030652.029279, converged: 99.99% (loc: 100.00%, scale update: False), in 0.00sec iter 16: ll=50030652.029279, converged: 100.00% (loc: 100.00%, scale update: True), in 155.19sec /mambaforge/envs/diffxpy/lib/python3.9/site-packages/dask/array/core.py:2912: RuntimeWarning: divide by zero encountered in true_divide size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))

However res.summary() then returns:

gene pval qval log2fc mean zero_mean grad coef_mle coef_sd ll
AL627309.1 0.080515 0.237762 [dask.array<getitem, shape=(), dtype=float64, ... 0.001634 False 3.163008e-03 [dask.array<getitem, shape=(), dtype=float64, ... 6.267832e-01 0.000000

I also tried converting adata.X=adata.X.toarray() which returns the same result.

Looking forward to your response, in case you need additional information, please let me know.

Best, Patrick

c-westhoven commented 10 months ago

I have the same issue with summary returning dask.arrays

I'm using: anndata 0.9.2 diffxpy 0.7.4+103.gceb79aa ig/batchglm_api_update batchglm 0.7.4. dev dask 2021.4.1 sparse 0.9.1

My adata.X is raw data.

running the de.test.wald model prints

The provided param_names are ignored as the design matrix is of type <class 'patsy.design_info.DesignMatrix'>.

The provided param_names are ignored as the design matrix is of type <class 'patsy.design_info.DesignMatrix'>.
AnnData object with n_obs × n_vars = 300 × 500
    obs: 'sample_x', 'dataset', 'condition'
training location model: False
training scale model: True
iter   0: ll=142879.916890
iter   1: ll=142879.916890, converged: 0.00% (loc: 100.00%, scale update: False), in 0.00sec
iter   2: ll=32021.056719, converged: 41.20% (loc: 41.20%, scale update: True), in 3.86sec
iter   3: ll=32021.056719, converged: 41.20% (loc: 100.00%, scale update: False), in 0.00sec
iter   4: ll=30460.923372, converged: 87.40% (loc: 87.40%, scale update: True), in 2.47sec
iter   5: ll=30460.923372, converged: 87.40% (loc: 100.00%, scale update: False), in 0.00sec
iter   6: ll=30281.772796, converged: 96.00% (loc: 96.00%, scale update: True), in 1.26sec
iter   7: ll=30281.772796, converged: 96.00% (loc: 100.00%, scale update: False), in 0.00sec
iter   8: ll=30258.987554, converged: 99.20% (loc: 99.20%, scale update: True), in 1.01sec
iter   9: ll=30258.987554, converged: 99.20% (loc: 100.00%, scale update: False), in 0.00sec
iter  10: ll=30250.794861, converged: 99.80% (loc: 99.80%, scale update: True), in 0.73sec
iter  11: ll=30250.794861, converged: 99.80% (loc: 100.00%, scale update: False), in 0.00sec
iter  12: ll=30250.794861, converged: 100.00% (loc: 100.00%, scale update: True), in 0.53sec
/home/c490n/.local/lib/python3.8/site-packages/dask/array/core.py:2912: RuntimeWarning: divide by zero encountered in true_divide
  size = (limit / dtype.itemsize / largest_block) ** (1 / len(autos))

My .summary output does not include coef_mle or coef_sd.

c-westhoven commented 10 months ago

I'm able to get numerics for coef_mle by converting self.theta_mle in _test in det.py under diffxpy/testing/ to a np.array. In the summary df, the dask array containing the mle information for all of the elements is saved.

def _test(self):
        """
        Returns the p-value for differential expression for each gene

        :return: np.ndarray
        """
        # Check whether single- or multiple parameters are tested.
        # For a single parameter, the wald statistic distribution is approximated
        # with a normal distribution, for multiple parameters, a chi-square distribution is used.
        self.theta_mle = self.model_estim.model_container.theta_location[self.coef_loc_totest]
        dir_path = os.path.dirname(os.path.realpath(__file__))
        if len(self.coef_loc_totest) == 1:
            print("_test, == 1", dir_path, self.theta_mle)
            **self.theta_mle = np.array(self.theta_mle[0])**

            print("_test, == 1, np.array", dir_path, self.theta_mle)
...
            )
c-westhoven commented 10 months ago

I'm not sure how my quick fixes would help solve the underlying issue.

But for log2fc I also added the np.array() command to the return of the log2_foldchange function in det.py. I think adding np.array to the log_fold_change function should also be possible.

    def log2_fold_change(self, **kwargs):
        """
        Calculates the pairwise log_2 fold change(s) for this DifferentialExpressionTest.
        """
#         return self.log_fold_change(base=2, **kwargs)
        return np.array(self.log_fold_change(base=2, **kwargs))

With the change to log2_fold_change, my error in test.plot_volcano (int is not iterable) was solved as well.