jeffsocal / tidyproteomics

An S3 data object and framework for common quantitative proteomic analyses
https://jeffsocal.github.io/tidyproteomics/
MIT License
36 stars 5 forks source link

Handling blank/Missing Values in Protein Abundance using Tidyproteomics #22

Open priyatamapandey opened 3 days ago

priyatamapandey commented 3 days ago

Hi, I have been using Tidyproteomics for processing one of my proteomics datasets and encountered a few questions that I couldn’t find clear answers to in the documentation. Specifically, in the protein abundance file (Excel), I noticed that for some proteins, there are no values in the cells for any of the samples, resulting in blank cells.

When I perform normalization using limma and convert the data to a wider format (where rows represent proteins and columns represent samples), I see that some proteins have entire rows filled with NA values (indicating no abundance for that protein across all samples). In other cases, some of the samples for certain proteins also have NA values.

My question is whether I should replace all the NA values with zeros, and remove proteins where no abundance is detected in any of the samples. Is this the approach that Tidyproteomics uses for handling such datasets? Also, I noticed that the plot_venn function in Tidyproteomics seems to include proteins with no abundance for any of the samples. Could you clarify how to best handle these cases?

I would appreciate any guidance or suggestions!

Best, Priya

jeffsocal commented 3 days ago

Regarding "..(Excel), I noticed that for some proteins, there are no values in the cells for any of the samples, resulting in blank cells..", this typically indicates that one or more MS2 spectra identified the presence of a protein yet the analysis platform was unable to quantify it. This can happen if an MS1 feature was not found within the LC and MZ tolerances or the TMT tags were below detection thresholds. Tidyproteomics preserves the NAs in order to correctly account for all identified proteins, and thus when exporting you get a row of NAs. Tidyproteomics can handle missing NAs when they are missing for some samples, but not all (Missing Data) and impute (Imputation) values for down stream analyses. I do not recommend altering the original analysis file by hand as there is no record of what happened, you can exclude missing values (by keeping values not imputed) in Tidyproteomics with the subset(imputed == 0), and there will be a record of it in the data object and in your code.

I hope that helps. I appreciate your questions.

priyatamapandey commented 3 days ago

Thank you for your quick response and guidance regarding the issue I am facing with some proteins having no values across all samples. I have a few additional questions about handling missing values in my dataset:

  1. Using subset(imputed == 0) I do see it counted increased missing value from 28.2% to 29.5%. Is this calculating missing value for some of the samples or for the proteins when there is missing value for all samples or both the condition? Screenshot 2024-11-26 at 2 58 23 PM

After performing subset(imputed == 0), I used prot_norm1$quantitative file to keep the normalized data. This normalized data is what I am planning to use for integration and other downstream analyses, so I am wondering what is the impact of subset(imputed == 0) on my normalized data. Because, I formatted the normalized data in the protein by sample dimension (till pivot_wider), total observation remains same 11522. Further, when I removed when all the columns has NA, I removed 16.73% proteins with missing value for all the samples (last line filter command in the below code).

inner_join(prot_norm1$quantitative) %>% mutate(abundance_limma = log10(abundance_limma)) %>% dplyr::select(sampleID, protein, abundance_limma) %>% pivot_wider(names_from = sampleID, values_from = abundance_limma) %>% dplyr::filter(!apply(.[,-1], 1, function(x) all(is.na(x))))

Would it be fair to replace 0 for NAs when they are missing for some samples in my final normalized data frame (protein by sample)?

  1. To clarify, what is subset(imputed == 0) performed. Is it subsetting (filtering out) the entire protein when there are missing for all the samples such as replacing 0 where all samples have missing value? and nothing with partial missing sample?

  2. After the differential expression, output returned 71% of the protein only. I found it removed 28.75% proteins which includes 16.73% where proteins are missing for all the samples + ~12% (where I am guessing it might be where one group has missing value). Differential expression t-test probably excluding it.

I appreciate if you can comment and provide my insights on these.

Thank you so much for all your help, Priya

jeffsocal commented 2 days ago

RE: 1) and 2) ... subset(imputed == 0) is a filter-in process, similar to dplyr::filter() - this function filters in only proteins that have complete values, this is accounts for all imputation methods either mathematical or match-between-runs and removes only those on a per protein-sample basis. When importing from the protein level, most analytical platforms indicate 0 or 1 for imputation, however, if protein abundances where generated from the peptide-level in Tidyproteomics, there would be a fractional imputation showing. I would not suggest replacing NAs with zeros, as that implies a measured absence, and quantitative analyses would assume the value is 0 and use that in the analysis, drastically skewing the average. The code snip you provided in 1) could be simplified to prot_norm %>% as.data.frame(shape = "long") as long as limma is the selected normalization those abundance values will be exported.

3) It sounds like you assumptions are correct.