Open BeyondTheProof opened 1 year ago
Hi @BeyondTheProof, it's a bit hard for me to guess what's going wrong here...
Apparently the overflow starts in fit_genewise_dispersions
when calling irls_solver
to fit initial mu values, but I'm not sure why.
Does your data have very large counts values?
A silly solution @BeyondTheProof Is to limit the number of genes to < 500. Here:
n = 499 # Change this to the desired number of columns not greater than 500
random_genes = np.random.choice(genes_to_keep, size=n, replace=False)
# Create a copy of the DataFrame with only the randomly selected columns
counts_unstranded_random = gbm_counts_unstranded[random_genes]
`Fitting size factors... ... done in 0.01 seconds.
Fitting dispersions... ... done in 0.32 seconds.
Fitting dispersion trend curve... ... done in 0.25 seconds.
Fitting MAP dispersions... ... done in 0.36 seconds.
Fitting LFCs... ... done in 0.15 seconds.
Refitting 30 outliers.
Fitting dispersions... ... done in 0.03 seconds.
Fitting MAP dispersions... ... done in 0.03 seconds.
Fitting LFCs... ... done in 0.02 seconds. `
Let me correct myself just do:
def rename_duplicate_columns(df):
cols = pd.Series(df.columns)
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + '_' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
df.columns = cols
avg_counts = gbm_counts_unstranded.mean()
# Filter columns based on the average
filtered_columns = avg_counts[(avg_counts > 10)].index # & (avg_counts < 10000)
print(len(filtered_columns))
# Subset the data frame based on the filtered columns
filtered_df = gbm_counts_unstranded[filtered_columns]
It should work after it.
To Reproduce
Output:
This part finishes, but gives runtime warnings. Will this make the output incorrect? Why would this happen? Also, I'm getting something similar with
stat_res.lfc_shrink()
:Output:
Using version 0.4.0
Expected behavior Not getting overflows. Maybe due to precision? Maybe need long floats? My suspicion is that this is happening when certain combinations of design factors yield 0 counts, and it is impossible to determine a dispersion for that gene. What is the best approach here?
Desktop (please complete the following information):