lmika / goseq

A command line tool to generate sequence diagrams
https://goseq.lmika.dev
MIT License
204 stars 26 forks source link

Inconsistent nullp and goseq enrichment results across operating systems #23

Open daniellembecker opened 1 week ago

daniellembecker commented 1 week ago

Hello,

We are trying to find the root issue of why we are having differing final output results from using the same datasets, parameters, functions, R (4.4.1), RStudio ("2024.04.2+764"), and goseq (1.56.0) package versions. Using the code below, I have tried to replicate the results with the exact same cloned repository on multiple computers and operating systems while checking that all steps are identical when running. However, with this code I am receiving 31 significantly enriched terms, while when someone else clones the repository or the code is used on different operating systems we see there are no significantly enriched terms. This is clearly a major issue and we have tried multiple steps to solve the issue as updating all versions to be the same, trying on multiple systems, cloning and recloning the repo, updating the repo and confirming it has been pushed to Github correctly but we have run out of troubleshooting options. Any ideas or help would be much appreciated!

Generate vector with names of all genes detected in our dataset

ALL.vector <- c(filtered_Pverr.annot$gene_id)

Generate length vector for all genes

LENGTH.vector <- as.integer(filtered_Pverr.annot$length)

Generate vector with names in just the contrast we are analyzing

ID.vector.down <- DEG.res %>% filter(direction=="Downregulated") %>% pull(gene_id)

Get a list of GO Terms for all genes detected in our dataset

GO.terms <- filtered_Pverr.annot%>%
  dplyr::select(gene_id, GO.ID)

Format to have one goterm per row with gene ID repeated

split <- strsplit(as.character(GO.terms$GO.ID), ";") 
split2 <- data.frame(v1 = rep.int(GO.terms$gene_id, sapply(split, length)), v2 = unlist(split)) 
colnames(split2) <- c("gene_id", "GO.ID")
GO.terms<-split2

Construct list of genes with 1 for genes in contrast and 0 for genes not in the contrast

gene.vector.down=as.integer(ALL.vector %in% ID.vector.down) 
names(gene.vector.down)<-ALL.vector#set names

weight gene vector by bias for length of gene: load in the gene.vector(a list of all genes with 1 indicating it is a DEG in the group of interest and 0 meaning it is not a DEG) and bias.data (list of lengths for all genes)

pwf.down<-nullp(gene.vector.down, bias.data=LENGTH.vector) 

data(genes)

pwf <- nullp(genes, 'hg19', 'ensGene')

run goseq using Wallenius method for all categories of GO terms: load in the pwf object (gene name, a 1 or 0 to indicate DEG, the length, and the pwf function; generated from pwf above), and the list of GO terms for all genes

GO.wall.down<-goseq(pwf.down, gene2cat=GO.terms, test.cats=c("GO:BP", "GO:MF", "GO:CC"), method="Wallenius", use_genes_without_cat=TRUE)

GO.down <- GO.wall.down[order(GO.wall.down$over_represented_pvalue),]
colnames(GO.down)[1] <- "GOterm"

#adjust p-values 
GO.down$bh_adjust <- p.adjust(GO.down$over_represented_pvalue, method="BH") #add adjusted p-values

#Filtering for p < 0.05
GO.down <- GO.down %>%
    dplyr::filter(bh_adjust<0.05) %>%
    dplyr::arrange(., ontology, bh_adjust)

Output of data from MacOS Monterey version 12.6.6:

Screen Shot 2024-06-21 at 2 44 44 PM Screen Shot 2024-06-21 at 2 51 22 PM

Output of data from macOS Sonoma 14.5:

Screenshot 2024-06-21 at 11 47 53 Screenshot 2024-06-21 at 11 42 10

We also both ran a test dataset: data(genes) pwf <- nullp(genes, ‘hg19’, ‘ensGene’)

Which gave us both this same output: Screenshot 2024-06-21 at 11 58 56

daniellembecker commented 4 days ago

@lmika

lmika commented 16 hours ago

Hi @daniellembecker , I'm not sure what this bug refers to. This doesn't look like something that goseq generally supports (it's a tool for generating sequence diagrams).

daniellembecker commented 53 minutes ago

Hi @lmika thank you for the response. The issue is inconsistency with the nullp and goseq functions in the goseq package. When used on different computers the package is not computing the same values using the same data, code, and functions from the goseq package