hadasvolk / CompLabNGS

Computational Lab in Next Generation Sequencing and Genomics Data Analysis - TAU 0411358701
MIT License
1 stars 1 forks source link

How to convert dseq results to dataframe? #11

Closed ranelmalka100 closed 4 months ago

ranelmalka100 commented 4 months ago

Hello :)

After dealing with a lot of errors, I've finally got the expected table out of deseq script - `Log2 fold change & Wald test p-value: strain S288C vs RM11 baseMean log2FoldChange lfcSE stat pvalue padj 0 0.000000 NaN NaN NaN NaN NaN 1 0.000000 NaN NaN NaN NaN NaN 2 0.000000 NaN NaN NaN NaN NaN 3 0.160284 0.879404 2.528475 0.347800 0.727990 NaN 4 2.424527 4.707459 2.135109 2.204786 0.027469 NaN ... ... ... ... ... ... ... 6692 0.000000 NaN NaN NaN NaN NaN 6693 1.950290 -0.904463 1.170501 -0.772715 0.439691 NaN 6694 0.641902 1.432367 1.793678 0.798564 0.424543 NaN 6695 2.444357 0.447478 1.025930 0.436168 0.662715 NaN 6696 1.110214 0.328981 1.382793 0.237910 0.811951 NaN

[6697 rows x 6 columns]`

I now trying to move forward and filter the table by log2foldchange>1 and pvalue<0.05, but it seems that the results dataframe i'm trying to get is empty. This is my code -

import pandas as pd
from pydeseq2.default_inference import DefaultInference
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats

# Read sample info into DataFrame with 'sample' column as index
sample_info_df = pd.read_csv('sample_info.tsv', sep='\t', index_col='sample')

# Read count data into DataFrame (assuming counts.tsv is in the same directory)
count_data_df = pd.read_csv('counts.tsv', sep='\t', skiprows=[0])

# Drop irrelevant columns
count_data_df = count_data_df.drop(columns=['Geneid', 'Chr', 'Start', 'End', 'Strand', 'Length'])

# Transpose the count data
transposed_df = count_data_df.transpose()

# Create an instance of DefaultInference
inference = DefaultInference(n_cpus=8)

# Create a DESeqDataSet object
dds = DeseqDataSet(
    counts=transposed_df,
    metadata=sample_info_df,
    design_factors=['batch', 'strain'],
    refit_cooks=True,
    inference=inference
)

# Run DESeq2 analysis
dds.deseq2()
results = DeseqStats(dds, inference=inference)

print(results.summary())

summary_df = pd.DataFrame(results.summary())

filtered_results = summary_df[summary_df['pvalue'] < 0.05]

and this is the errors i'm getting (also with log2foldchange filtering) (ignore the padj I tried to filter by, my mistake)

AttributeError: 'NoneType' object has no attribute 'summary'

`File "/home/nofar/new.py", line 40, in filtered_results = results.summary()[(results.summary()['log2FoldChange'] > 1) & (results.summary()['padj'] < 0.05)]


TypeError: 'NoneType' object is not subscriptable`

I would be happy to get some help, 
Thanks in advanve, Nofar :-]
ranelmalka100 commented 4 months ago

I didn't mean to upload this with these big headlines, but it's cool ><

hadasvolk commented 4 months ago

After producing the deseq statistics (DeseqStats) one of the attributes of the returned object is the results pandas dataframe you are after

results = DeseqStats(dds, inference=inference)
print(results.results_df) # Will print the dataframe you are after
print(type(results.results_df)) # To validate that this is indeed a pd.DataFrame object

https://pydeseq2.readthedocs.io/en/latest/api/docstrings/pydeseq2.ds.DeseqStats.html#pydeseq2.ds.DeseqStats.results_df

ranelmalka100 commented 4 months ago

I still get an error, and this time - Traceback (most recent call last): File "/home/nofar/new.py", line 37, in <module> print(results.results_df) # Will print the dataframe you are after ^^^^^^^^^^^^^^^^^^ AttributeError: 'DeseqStats' object has no attribute 'results_df'

I tried to change the results name, or put antother identification to it, such as: results_df = results.results, but I still can not access to the Deseq dataframe..

hadasvolk commented 4 months ago

Did you run the results.summary()?

results = DeseqStats(dds, inference=inference)
results.summary()
print(results.results_df) # Will print the dataframe you are after
print(type(results.results_df)) # To validate that this is indeed a pd.DataFrame object
hadasvolk commented 4 months ago

assuming this issue is resolved