mildpiggy / DEP2

An r package for proteomics data Analysis, developed from DEP.
Other
14 stars 2 forks source link

Using DEP2 without meta-data #4

Closed cathalgking closed 2 months ago

cathalgking commented 5 months ago

I am working with data that does not have condition or replicate information for each sample, so do not have the experimental design file. The dataset contains 6 plates with 78 patient samples per plate (blinded to disease state) and on each plate there is 2 pooled biological QC samples and 3 samples to control for sample variability. Can batch effects be detected or controlled for here with DEP2? Either across the 6 plates or within plates. Thanks

mildpiggy commented 5 months ago

Hi. DEP2 cannot handle large‐scale samples from different arrays, like different plates in your case. You can use QC functions like plot_umap/Tsne/dist/normalization to check the deviation within a matrix. But it doesn't provide function to correct batch effect or directly merge quantities across arrays.

cathalgking commented 5 months ago

ok thanks @mildpiggy We actually might be getting more patient meta-data soon that we can make an experimental table with. So for each of the 78 samples we would then know the condition and replicate. Would this then allow us to get more use from DEP2? If so, what would the best functions to apply for this data

mildpiggy commented 4 months ago

If you concerned the group differences within 78 samples, you can directly make_se with you exp design. I lack more information of your experiment, but you can follow the DEA workflow in the vignettes and perform some QC. In large-scale proteomics, correcting batch effect and merging quantity among plates should be the key (just in my option). You may also consider using other tools like statTarget[https://stattarget.github.io/]. Afterward, you can performan a test using DEP2 on the intergrated matrix. BTW, although the advanced_contrast in test_diff can provide a slightly flexible contrast design, but DEP2 still lack the ability to handle experiments with multiple factors design.

cathalgking commented 4 months ago

Thanks for your reply @mildpiggy I have been trying to make an SummarizedExperiment with make_se() but am running into an error.

Error in make_se(data_unique, sample_columns, expdesign = experimental_design_p1) : 
  Labels of the experimental design do not match with column names in 'proteins_unique'
Run make_se() with the correct labels in the experimental designand/or correct columns specification

My experiment design and data frame columns are shown below. Can you advise on what I need to do to construct this SE?

Screenshot 2024-04-18 at 9 16 34 PM

I only want to use the "QC" or "RC" columns in my data frame for this initial analysis:

Screenshot 2024-04-18 at 9 20 41 PM
mildpiggy commented 4 months ago

It appears that there are errors in your label column. The labels should correspond one by one with the expression columns in your unique_data. Upon your first screenshot,, I noticed that 'RCV01' are duplicated, it should instead range from 'RCV01' to 'RCV08'(based on the expression colnames in your uniqued table )?

cathalgking commented 4 months ago

Hi @mildpiggy That worked for me once I corrected the label column and the relevant columns in the data frame. I now am trying to run test_diff and am running into an error, maybe similar to the above. When I try:

diff_pg <- test_diff(imp_pg, type = "control", control = "QC1", fdr.type = "BH")

I get the error:

Tested contrasts: QC2_vs_QC1, RC1_vs_QC1, RC2_vs_QC1, RC3_vs_QC1, RCV1_vs_QC1, RCV2_vs_QC1, RCV3_vs_QC1, RCV4_vs_QC1, RCV5_vs_QC1, RCV6_vs_QC1, RCV7_vs_QC1, RCV8_vs_QC1
Error in .ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim,  : 
  No finite residual standard deviations

When I try:

diff_pg2 <- test_diff(imp_pg, type = "manual", test  = c("RCV2_vs_QC1"), fdr.type = "BH")

I get the error:

Tested contrasts: RCV2_vs_QC1
Error in .ebayes(fit = fit, proportion = proportion, stdev.coef.lim = stdev.coef.lim,  : 
  No finite residual standard deviations

I have tried multiple variations for test. Can you advise on how to set the parameters here?

mildpiggy commented 4 months ago

@cathalgking It is because there's no replication (only containing one sample) in both contrast conditions. I would like to clarify if all of your conditions are distinct. Is QC1 different from QC2? Are RC1, RC2, and RC3 unique from one another? If they are part of repeated measurements within the same group, it would be more appropriate to label them as QC, RC, and RCV, instead of QC1, QC2, RC1, and RC2. The repeated samples should have the same designated "condition".

cathalgking commented 4 months ago

@mildpiggy QC1 and QC2 are the same in the same way that RC1 is the same as RC2 etc. I tried changing the names of the columns to be QC and QC for example but now when I try to run make_unique, I get an error saying that there cannot be duplicate names.

Error in `dplyr::mutate()`:
! Can't transform a data frame with duplicate names.
Run `rlang::last_trace()` to see where the error occurred.
cathalgking commented 4 months ago

so the colnames in my data frame is: Screenshot 2024-04-21 at 9 22 59 PM and my experimental design file is: Screenshot 2024-04-21 at 9 22 46 PM

mildpiggy commented 4 months ago

@cathalgking Here, once again, the question you initially raised arises: the labels and colnames need to correspond to each other one-to-one and should not be duplicated. The following is a example dataset and code for you to study the requirements of sample naming and experimental design.

library(DEP2)
library(fdrtool)
## colnames should be unique
ecols = c(paste0("QC_",1:2), paste0("RC_",1:3), paste0("RCV_",1:8))

## The standard form of experimental design. get_exdesign_parse can directly build a exp design table.
exp_design = DEP2::get_exdesign_parse(ecols,mode = "delim", sep = "_")

## A random example data
mat = matrix(sample(10:20,13*24,replace = T),ncol = 13) %>% as.data.frame()
colnames(mat) = ecols
unique_data = cbind(mat,name = letters[1:24], ID = LETTERS[1:24])
se = make_se(unique_data, columns = ecols, expdesign = exp_design)

## Just as an example, I skipped filter, imputation and normalization for this case.
diff = test_diff(se, type = "control", control = "QC",fdr.type = "BH")
dep = add_rejections(diff = diff)

## Notice that lables and colnames must be same and unique.
colnames(unique_data)
colData(dep)
exp_design

## If no delim ("_") in ecols, and the replicate number only occupies one digit.
exp_design = DEP2::get_exdesign_parse(gsub("_","",ecols),mode = "char", chars = 1) 

It appears that there are errors in your label column. The labels should correspond one by one with the expression columns in your unique_data. Upon your first screenshot,, I noticed that 'RCV01' are duplicated, it should instead range from 'RCV01' to 'RCV08'(based on the expression colnames in your uniqued table )?

The expdesign should be: image or (it depend on which pattern expression colnames is ) image

cathalgking commented 4 months ago

Hi @mildpiggy That example worked. Thanks for that. Our data includes QC samples as you seen above and also patient samples all in the one table. I have ran through some very useful analyses of the QC samples after successfully reading into an SE.

Now I would like to construct another SE for the patient samples, this time with a different exp design file. The label column matches up with the expression columns. However when I try to make the SE I get an Error:

data_se_box5 <- make_se(proteins_unique = unique_pg, 
+                    sample_columns, 
+                    expdesign = patient_identifier)

Error in `.rowNamesDF<-`(x, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘CFC_0’, ‘MGUS_1’, ‘MM_2’ 

I have tried various things with the data but cannot seem to get past this. Can you advise what to try here? Thanks again.

mildpiggy commented 4 months ago

@cathalgking According the error, it may be due to the duplicated expression colnames? Can you check the expression cols or exp design table? If not, could you give more information about you sample design?