mildpiggy / DEP2

An r package for proteomics data Analysis, developed from DEP.
Other
18 stars 4 forks source link

Missing values lead to mis-labelling of points in DEP2::plot_volcano() #3

Closed 1Moe closed 7 months ago

1Moe commented 8 months ago

Hi there,

I am using DEP2 version 0.5.2.28. It seems to me, that the volcano plots generated with DEP2::plot_volcano() are plotting the correct points but labelling with incorrect labels. To my surprise, some points are labelled correctly while others are labelled incorrectly. I compared it with the EnhancedVolcano() function. The two give different results in my case.

My guess is, that this happens due to missing values (i.e. when you don't impute).

You can reproduce this error by running your example from the vignette and skipping point5 (imputation). Instead running the diff_test() on the normalized data norm_pg and then plotting the volcanos.

6. **Hypothesis testing**

The `test_diff` function performs a moderated t-test using limma. In DEP, p-values are corrected, using the fdrtool to classify significantly regulated/enriched candidates and stable proteins. DEP2 additionally provides two alternative frequently-used FDR control methods: "Benjamini-Hochberg fdr" and "Storey's qvalue".

```{r testdiff}
## Test every sample versus PBS control
diff_pg <- test_diff(norm_pg, type = "control", control = "PBS", fdr.type = "BH")
## Test on manual contrasts
diff_pg2 <- test_diff(impnorm_pg_pg, type = "manual", test  = c("W4_vs_PBS"), fdr.type = "Storey's qvalue")

Function add_rejections can classify significant hits according L2FC (lfc) and adjusted p value (alpha) threshold.

## Add significant rejections for features, based on 
dep_pg <- add_rejections(diff_pg, alpha = 0.01, lfc = 2)

## get the significant subset
dep_pg_sig <- get_signicant(dep_pg)
nrow(dep_pg_sig)

Use the plot_volcano function to quickly visualize the results.

### volcano plot on contrast "W10_vs_PBS"
plot1 <- DEP2::plot_volcano(dep_pg, contrast = "W10_vs_PBS", adjusted = F, label_number = 100)
print(plot1)

############# now compare to the EnhancedVolcano function

library(EnhancedVolcano)
plot_data <- rowData(dep_pg) %>% as.data.frame()
top100 <- plot_data[order(plot_data$W10_vs_PBS_p.val),][1:100,]
prots<- top100 %>% dplyr::select(name) %>% pull() #some proteins for labelling

plot2 <- EnhancedVolcano(plot_data, x = "W10_vs_PBS_diff",y = "W10_vs_PBS_p.val", lab = plot_data$name, selectLab = prots, drawConnectors = T)
print(plot2)
mildpiggy commented 8 months ago

Thank you for using DEP2 and providing feedback on this issue. The issue you reported is a very important one that we have overlooked. As you suspected, this problem is indeed caused by missing values in the test_diff input, which also exists in the DEP package. Specifically, missing values (if all quantity of a proteingroup in an group/condition are missing) result in NAs in the test results, causing the rowData in the output to be disordered. Additionally, plot_volcano uses rowname instead of the 'name' col in dep data as the feature identifier, which may also be causing the issue you raised. It also may effect other plot function because the identifiers disorder, like in the heatmap.

I have already resolved the bug causing the disordered rowData due to NA. You can check the latest commit (https://github.com/mildpiggy/DEP2/commit/63d2335) for details. You can try installing the new version of DEP2 and confirm it. Based on my testing, the volcano plot output from the following code is now consistent.

top15 <- plot_data[order(plot_data$W10_vs_PBS_p.val),][1:15,]
prots <- top15 %>% dplyr::select(name) %>% pull() 

DEP2::plot_volcano(dep_pg, contrast = "W10_vs_PBS", chooseTolabel = prots,adjusted = F, label_number = 100, 
                                add_threshold_line = "intersect", fcCutoff = 1, pCutoff = 1e-05)
EnhancedVolcano(plot_data, x = "W10_vs_PBS_diff",y = "W10_vs_PBS_p.val", lab = plot_data$name, 
                              ylim = max(plot_data$W10_vs_PBS_p.val),
                              selectLab = prots, drawConnectors = T)

However, we do not recommend retaining missing values during the processing, and our design did not consider preserving NAs in performing test. Test-induced NAs may still cause other unknown issues. As a personal suggestion, I recommend using impute(fun="zero") to replace NAs missing values (make_se would transfrom 0 to NAs in PG.txt) or filtering out the proteingroups that are not quantified in all sample of any condition. If you still want to retain NAs and encounter any new problems, please feel free to provide feedback.

1Moe commented 7 months ago

thanks for the response. I like to have the flexibility of not-imputing, I am not concerned about NA if the protein was not measured in one group.

How confident are you, that all the row-ordering and merging/joining has now been corrected? I am concerned that DEP, DEP2, LFQ Analyst and FragPipe-Analyst suffer from row-ordering issues.

https://github.com/MonashProteomics/FragPipe-Analyst/issues/94

mildpiggy commented 7 months ago

We had replaced merge function by left_jion in test_diff in my commit. merge also was already changed in add_rejections in the early DEP2. The rowdata order was correct in my test. Here, I paste a small function to confirmed the row order. It re-calculate the L2FC and match the existed result from test_diff. If you still worry about the row order, use test_function to check row order in "diff" or "dep" objects at any time.

# The function to check the row order by feature names and re-calculated L2FC.
test_function <- function(dep,contrasts = get_contrast(dep)){
  rd = rowData(dep)

  ## Check names order
  if(all(rd$name == rownames(dep))){
    cat("Name order right. \n")
  }else{
    stop("Name order mistake!")
  }

  ## Re-calculate L2FC
  cd = colData(dep) %>% as.tibble()

  mat = assay(dep)
  exp_df = mat %>% as.data.frame() %>% 
    tibble::rownames_to_column(., var = "name") %>% 
    tidyr::gather(label,value,-name,na.rm = FALSE) %>% left_join(cd,by = "label")

  gene_means <- exp_df %>%
    # filter(condition %in% c("PBS", "W10")) %>%
    group_by(name, condition) %>%
    summarize(mean_value = mean(value,na.rm = T), .groups = "drop") %>%
    tidyr::pivot_wider(
      names_from = condition,
      values_from = mean_value
    ) %>% as.data.frame()

  ## Compare L2FC
  diff_match = contrasts %>% sapply(.,function(x){
    conditions = strsplit(x,"_vs_")[[1]]
    diff_equal = all.equal(rd[,paste0(x,"_diff")], 
                           (gene_means[,conditions[1]] - gene_means[,conditions[2]]),
                           tolerance = 0.01)

    if(!isTRUE(diff_equal)) stop("L2FC in ",x," may be disorder!")
    return(diff_equal)
  })

  if(all(diff_match)){
    cat("All L2FC is right.\n")
  }
}

# Test workflow
data(Silicosis_pg)
data_unique <- make_unique(Silicosis_pg, "Gene.names", "Protein.IDs", delim = ";")
ecols <- grep("LFQ.", colnames(data_unique))

data(Silicosis_ExpDesign)
se <- make_se(data_unique, ecols, Silicosis_ExpDesign)

filt <- filter_se(se, thr = 0, fraction = 0.4, filter_formula = ~ Reverse != "+" & Potential.contaminant!="+")
norm <- normalize_vsn(filt)
diff_pg <- test_diff(norm, type = "control", control = "PBS", fdr.type = "BH")
dep_pg <- add_rejections(diff_pg, alpha = 0.01, lfc = 2)

# Test test_diff output
test_function(diff_pg)
# Test add_rejections output
test_function(dep_pg)

# Manually modified L2FC in test_diff.
rowData(diff_pg)$W4_vs_PBS_diff[1:2] = NA
test_function(diff_pg)  # report error