owkin / PyDESeq2

A Python implementation of the DESeq2 pipeline for bulk RNA-seq DEA.
https://pydeseq2.readthedocs.io/en/latest/
MIT License
573 stars 60 forks source link

[BUG] Overflow in `.deseq2()` and and `lfc_shrink()` #169

Open BeyondTheProof opened 1 year ago

BeyondTheProof commented 1 year ago

To Reproduce

dds = DeseqDataSet(
    counts=counts_df,
    metadata=metadata,
    design_factors=['condition', "RNA_Qubit_actual", 'treatment'],
    continuous_factors=["RNA_Qubit_actual"],
    refit_cooks=True,
    n_cpus=12,
)

dds.deseq2()

Output:

... done in 0.09 seconds.

/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:553: RuntimeWarning: overflow encountered in exp
  mu_ = np.maximum(size_factors * np.exp(X @ beta), min_mu)
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:404: RuntimeWarning: invalid value encountered in subtract
  -logbinom
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:557: RuntimeWarning: overflow encountered in exp
  mu_ = np.maximum(size_factors * np.exp(X @ beta), min_mu)
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:560: RuntimeWarning: invalid value encountered in divide
  + ((1 / disp + counts) * mu_ / (1 / disp + mu_)) @ X
...
Fitting dispersions...
... done in 5.68 seconds.

Fitting dispersion trend curve...
... done in 10.39 seconds.

Fitting MAP dispersions...
... done in 6.41 seconds.

Fitting LFCs...
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:406: RuntimeWarning: invalid value encountered in multiply
  - counts * np.log(mu)
... done in 6.06 seconds.

Refitting 0 outliers.

This part finishes, but gives runtime warnings. Will this make the output incorrect? Why would this happen? Also, I'm getting something similar with stat_res.lfc_shrink():

cooks_filter=True

stat_res = DeseqStats(
    dds,
    contrast=["RNA-Qubit-actual", "", ""],
    alpha=0.05,
    cooks_filter=True,
    independent_filter=True,
)
stat_res.run_wald_test()

# Filter based on Cook's outliers
if cooks_filter:
    stat_res._cooks_filtering()

# Does either independent filtering, where genes are filtered out based on low counts
if stat_res.independent_filter:
    stat_res._independent_filtering()
# or uses the Benjamini-Hochberg procedure to adjust p-values
else:
    stat_res._p_value_adjustment()

# Shrink log fold changes toward 0
stat_res.lfc_shrink()

Output:

Fitting MAP LFCs...
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:1185: RuntimeWarning: overflow encountered in exp
  counts - (counts + size) / (1 + size * np.exp(-xbeta - offset))
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:1185: RuntimeWarning: overflow encountered in exp
  counts - (counts + size) / (1 + size * np.exp(-xbeta - offset))
/home/jupyter/env/mambaforge/envs/rome/lib/python3.11/site-packages/pydeseq2/utils.py:1185: RuntimeWarning: overflow encountered in exp
...
... done in 27.00 seconds.

Using version 0.4.0

Expected behavior Not getting overflows. Maybe due to precision? Maybe need long floats? My suspicion is that this is happening when certain combinations of design factors yield 0 counts, and it is impossible to determine a dispersion for that gene. What is the best approach here?

Desktop (please complete the following information):

BorisMuzellec commented 1 year ago

Hi @BeyondTheProof, it's a bit hard for me to guess what's going wrong here... Apparently the overflow starts in fit_genewise_dispersions when calling irls_solver to fit initial mu values, but I'm not sure why. Does your data have very large counts values?

sermare commented 11 months ago

A silly solution @BeyondTheProof Is to limit the number of genes to < 500. Here:

n = 499  # Change this to the desired number of columns not greater than 500
random_genes = np.random.choice(genes_to_keep, size=n, replace=False)

# Create a copy of the DataFrame with only the randomly selected columns
counts_unstranded_random = gbm_counts_unstranded[random_genes]

`Fitting size factors... ... done in 0.01 seconds.

Fitting dispersions... ... done in 0.32 seconds.

Fitting dispersion trend curve... ... done in 0.25 seconds.

Fitting MAP dispersions... ... done in 0.36 seconds.

Fitting LFCs... ... done in 0.15 seconds.

Refitting 30 outliers.

Fitting dispersions... ... done in 0.03 seconds.

Fitting MAP dispersions... ... done in 0.03 seconds.

Fitting LFCs... ... done in 0.02 seconds. `

sermare commented 11 months ago

Let me correct myself just do:


def rename_duplicate_columns(df):
    cols = pd.Series(df.columns)
    for dup in cols[cols.duplicated()].unique():
        cols[cols[cols == dup].index.values.tolist()] = [dup + '_' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
    df.columns = cols

avg_counts = gbm_counts_unstranded.mean()

# Filter columns based on the average
filtered_columns = avg_counts[(avg_counts > 10)].index # & (avg_counts < 10000)

print(len(filtered_columns))

# Subset the data frame based on the filtered columns
filtered_df = gbm_counts_unstranded[filtered_columns]

It should work after it.