owkin / PyDESeq2

A Python implementation of the DESeq2 pipeline for bulk RNA-seq DEA.
https://pydeseq2.readthedocs.io/en/latest/
MIT License
588 stars 63 forks source link

[BUG] stuck on dds.deseq2() #338

Open Erikado4 opened 3 days ago

Erikado4 commented 3 days ago

Everytime I DeseqDataSet using my own data (as counts/metadata dataframes or straight from my anndata object) or using the test data on the GettingStarted docs, I then try to run dds.deseq2() and the RAM shoots up and the kernel crashes.

The parameters for DeseqDataSet in the tutorials and the current version do not match.

Reallly would've liked to use this tool :(

BorisMuzellec commented 2 days ago

Hi @Erikado4, I'm going to need a bit more information to be able to help you.

Could you fill in the bug template below?

Describe the bug A clear and concise description of what the bug is. NB: for questions about pydeseq2 that are not related to a bug, please open a topic on the scverse ecosystem Discourse forum.

To Reproduce Provide snippets of code and steps on how to reproduce the behavior. Please also specify the version you are using.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

Erikado4 commented 2 days ago

I use the same code from the tutorial: [https://pydeseq2.readthedocs.io/en/latest/auto_examples/plot_pandas_io_example.html#sphx-glr-auto-examples-plot-pandas-io-example-py] Loading data and saving results with pandas and pickle

Except design="~condition" is no longer a recognized parameter so I changed it too design_factors="condition".

Then with the final dds.deseq2() it is stuck here indefinitely.

image

import os
import pickle as pkl

import pandas as pd

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats

DATA_PATH = "https://raw.githubusercontent.com/owkin/PyDESeq2/main/datasets/synthetic/"
counts_df = pd.read_csv(os.path.join(DATA_PATH, "test_counts.csv"), index_col=0)
print(counts_df)

counts_df = counts_df.T
print(counts_df)

metadata = pd.read_csv(os.path.join(DATA_PATH, "test_metadata.csv"), index_col=0)
print(metadata)

genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

inference = DefaultInference(n_cpus=8)
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design_factors="condition",
    refit_cooks=True,
    inference=inference,
)

dds.deseq2()

pydeseq2 0.4.12 Model: Precision 7960 Tower OS: Arch Linux x86_64 Kernel: 6.11.5-arch1-1 Shell: zsh

Erikado4 commented 2 days ago

Update: I tried to run each step individually from the Step-by-Step tutorial (again with the synthetic data) and it seems to be getting stuck on dds.fit_genewise_dispersions()

image

BorisMuzellec commented 1 day ago

Hi @Erikado4,

Thanks for providing the details.

I'm a bit lost here because I can't reproduce the issue on my machine (Mac OS). I'm assuming this bug has something to do with the fact you're using Arch Linux.

Given where the code is stuck, I think it's either the joblib parallelization that is the problem, or fit_lin_mu, which calls the linear regression from scikit learn (in which case the problem could come from low-level linear algebra librairies like BLAS, but this is a wild guess).

Could you try the code below and tell me if anything different happens?

import os
import pickle as pkl

import pandas as pd

from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats

DATA_PATH = "https://raw.githubusercontent.com/owkin/PyDESeq2/main/datasets/synthetic/"
counts_df = pd.read_csv(os.path.join(DATA_PATH, "test_counts.csv"), index_col=0)
print(counts_df)

counts_df = counts_df.T
print(counts_df)

metadata = pd.read_csv(os.path.join(DATA_PATH, "test_metadata.csv"), index_col=0)
print(metadata)

genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10]
counts_df = counts_df[genes_to_keep]

inference = DefaultInference(n_cpus=8, backend="threading")
dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design_factors="condition",
    refit_cooks=True,
    inference=inference,
)

dds.deseq2()

It's the same thing with a different joblib backend (inference = DefaultInference(n_cpus=8, backend="threading")).

Erikado4 commented 1 day ago

image

Erikado4 commented 1 day ago

image

Erikado4 commented 1 day ago

Thanks for trying to help out! I get this error now