Closed 1Moe closed 7 months ago
Thank you for using DEP2 and providing feedback on this issue. The issue you reported is a very important one that we have overlooked. As you suspected, this problem is indeed caused by missing values in the test_diff
input, which also exists in the DEP package. Specifically, missing values (if all quantity of a proteingroup in an group/condition are missing) result in NAs in the test results, causing the rowData in the output to be disordered. Additionally, plot_volcano
uses rowname instead of the 'name' col in dep data as the feature identifier, which may also be causing the issue you raised. It also may effect other plot function because the identifiers disorder, like in the heatmap.
I have already resolved the bug causing the disordered rowData due to NA. You can check the latest commit (https://github.com/mildpiggy/DEP2/commit/63d2335) for details. You can try installing the new version of DEP2 and confirm it. Based on my testing, the volcano plot output from the following code is now consistent.
top15 <- plot_data[order(plot_data$W10_vs_PBS_p.val),][1:15,]
prots <- top15 %>% dplyr::select(name) %>% pull()
DEP2::plot_volcano(dep_pg, contrast = "W10_vs_PBS", chooseTolabel = prots,adjusted = F, label_number = 100,
add_threshold_line = "intersect", fcCutoff = 1, pCutoff = 1e-05)
EnhancedVolcano(plot_data, x = "W10_vs_PBS_diff",y = "W10_vs_PBS_p.val", lab = plot_data$name,
ylim = max(plot_data$W10_vs_PBS_p.val),
selectLab = prots, drawConnectors = T)
However, we do not recommend retaining missing values during the processing, and our design did not consider preserving NAs in performing test. Test-induced NAs may still cause other unknown issues. As a personal suggestion, I recommend using impute(fun="zero")
to replace NAs missing values (make_se
would transfrom 0 to NAs in PG.txt) or filtering out the proteingroups that are not quantified in all sample of any condition. If you still want to retain NAs and encounter any new problems, please feel free to provide feedback.
thanks for the response. I like to have the flexibility of not-imputing, I am not concerned about NA if the protein was not measured in one group.
How confident are you, that all the row-ordering and merging/joining has now been corrected? I am concerned that DEP, DEP2, LFQ Analyst and FragPipe-Analyst suffer from row-ordering issues.
https://github.com/MonashProteomics/FragPipe-Analyst/issues/94
We had replaced merge
function by left_jion
in test_diff
in my commit. merge
also was already changed in add_rejections
in the early DEP2. The rowdata order was correct in my test. Here, I paste a small function to confirmed the row order. It re-calculate the L2FC and match the existed result from test_diff
. If you still worry about the row order, use test_function
to check row order in "diff" or "dep" objects at any time.
# The function to check the row order by feature names and re-calculated L2FC.
test_function <- function(dep,contrasts = get_contrast(dep)){
rd = rowData(dep)
## Check names order
if(all(rd$name == rownames(dep))){
cat("Name order right. \n")
}else{
stop("Name order mistake!")
}
## Re-calculate L2FC
cd = colData(dep) %>% as.tibble()
mat = assay(dep)
exp_df = mat %>% as.data.frame() %>%
tibble::rownames_to_column(., var = "name") %>%
tidyr::gather(label,value,-name,na.rm = FALSE) %>% left_join(cd,by = "label")
gene_means <- exp_df %>%
# filter(condition %in% c("PBS", "W10")) %>%
group_by(name, condition) %>%
summarize(mean_value = mean(value,na.rm = T), .groups = "drop") %>%
tidyr::pivot_wider(
names_from = condition,
values_from = mean_value
) %>% as.data.frame()
## Compare L2FC
diff_match = contrasts %>% sapply(.,function(x){
conditions = strsplit(x,"_vs_")[[1]]
diff_equal = all.equal(rd[,paste0(x,"_diff")],
(gene_means[,conditions[1]] - gene_means[,conditions[2]]),
tolerance = 0.01)
if(!isTRUE(diff_equal)) stop("L2FC in ",x," may be disorder!")
return(diff_equal)
})
if(all(diff_match)){
cat("All L2FC is right.\n")
}
}
# Test workflow
data(Silicosis_pg)
data_unique <- make_unique(Silicosis_pg, "Gene.names", "Protein.IDs", delim = ";")
ecols <- grep("LFQ.", colnames(data_unique))
data(Silicosis_ExpDesign)
se <- make_se(data_unique, ecols, Silicosis_ExpDesign)
filt <- filter_se(se, thr = 0, fraction = 0.4, filter_formula = ~ Reverse != "+" & Potential.contaminant!="+")
norm <- normalize_vsn(filt)
diff_pg <- test_diff(norm, type = "control", control = "PBS", fdr.type = "BH")
dep_pg <- add_rejections(diff_pg, alpha = 0.01, lfc = 2)
# Test test_diff output
test_function(diff_pg)
# Test add_rejections output
test_function(dep_pg)
# Manually modified L2FC in test_diff.
rowData(diff_pg)$W4_vs_PBS_diff[1:2] = NA
test_function(diff_pg) # report error
Hi there,
I am using
DEP2 version 0.5.2.28
. It seems to me, that the volcano plots generated withDEP2::plot_volcano()
are plotting the correct points but labelling with incorrect labels. To my surprise, some points are labelled correctly while others are labelled incorrectly. I compared it with theEnhancedVolcano()
function. The two give different results in my case.My guess is, that this happens due to missing values (i.e. when you don't impute).
You can reproduce this error by running your example from the vignette and skipping point5 (imputation). Instead running the
diff_test()
on the normalized datanorm_pg
and then plotting the volcanos.Function
add_rejections
can classify significant hits according L2FC (lfc) and adjusted p value (alpha) threshold.Use the
plot_volcano
function to quickly visualize the results.