mortazavilab / PyWGCNA

PyWGCNA is a Python package designed to do Weighted Gene Correlation Network analysis (WGCNA)
https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad415/7218311
MIT License
192 stars 46 forks source link

varianceStabilizingTransformation and input data #80

Closed agolicz closed 6 months ago

agolicz commented 6 months ago

Hi, Thanks for the package! I have one question. From the documentation it is not entirely clear to me in which form counts should be passed to PyWGCNA.WGCNA. Ideally I would like to use varianceStabilizingTransformation of DESeq2.

Can I pass vst.csv from code below directly to PyWGCNA.WGCNA as geneExp matrix? From TPMcutoff paramater it looks like it expects TPM data? If that is the case what is your recommended process to get TPM from raw counts?

library(tximport)
library(rhdf5)
library(ggplot2)
library(DESeq2)

#Read data

files <- file.path("/vol/agcpgl/agolicz/annotation", "data", scan("list.full.helixer.txt", what=""), "abundance.h5")
names(files) <- scan("list.full.txt", what="")
tidm<-read.csv("ids.map", sep="\t", header=T)
txi <- tximport(files, type = "kallisto", tx2gene = tidm)
sampleTable <- read.csv("list.full.samples.txt", sep="\t", header=F)
rownames(sampleTable) <- colnames(txi$counts)
colnames(sampleTable)<-c("condition")
deseq <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)
keep <- rowSums(counts(deseq)) >= 10
deseq <- deseq[keep,]
deseq <- DESeq(deseq)

#Variance transform deseq

vsdB <- varianceStabilizingTransformation(deseq, blind = TRUE)
vsdB_table <- as.data.frame(assay(vsdB))
write.csv(vsdB_table, "vst.csv")
nargesr commented 6 months ago

Hi,

Thank you for your interest :)

quote from our paper: We recommend prior preprocessing and normalization of input gene or transcript expression data, including any necessary batch correction.

The data in the tutorial is the TPM matrix and within the package, we usually would like to remove any low-expressed genes across all samples, and TPMcutoff is the threshold to define low-expressed genes however it's up to the user to decide which input would work best so, in summary, you can pass any form of gene expression data as an input (csv, tsv or AnnData) and if you wish to not remove any low expressed genes you can put TPMcutoff to zero

agolicz commented 6 months ago

Ok, thanks. So beyond TPMcutoff there is no assumptions of data being TPM and I can pass anything so long I set TPMcutoff=0. Did I understand correctly?

nargesr commented 6 months ago

The only assumption is that you pass an expression data.
TPMcutoff can also considered as an expression cutoff and you can assign any value that suits your data