import2Dataset() crashes when providing replicated measurements

tobiasko commented 1 year ago

Dear @nkurzaw,

when I use the package function import2Dataset() in combination with replicated measurements (each temperature * drug combination was measured twice) it crashes. The function call is

import_df <- import2dDataset(configTable = tt,
                             data = ll,
                             idVar = "ProteinID",
                             intensityStr = "INT_",
                             fcStr = "rel_fc_",
                             nonZeroCols = "Total Spectral Count",
                             geneNameVar = "Entry Name",
                             addCol = "Gene",
                             qualColName = "Total Peptides",
                             naStrs = c("NA")
)

The config table looks like this:

> tt
# A tibble: 20 × 14
   Compound Experiment  Temperature `126` `127N` `127C` `128N` `128C` `129N` `129C` `130N` `130C` `131N` RefCol
   <chr>    <chr>             <dbl> <chr> <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
 1 drug     rep1_red           41.9 5     1      0.143  0.02   0      -      -      -      -      -      128C  
 2 drug     rep1_red           43.6 -     -      -      -      -      5      1      0.143  0.02   0      131N  
 3 drug     rep1_orange        45.7 5     1      0.143  0.02   0      -      -      -      -      -      128C  
 4 drug     rep1_orange        48.7 -     -      -      -      -      5      1      0.143  0.02   0      131N  
 5 drug     rep1_yellow        51.5 5     1      0.143  0.02   0      -      -      -      -      -      128C  
 6 drug     rep1_yellow        54.2 -     -      -      -      -      5      1      0.143  0.02   0      131N  
 7 drug     rep1_green         57   5     1      0.143  0.02   0      -      -      -      -      -      128C  
 8 drug     rep1_green         59.9 -     -      -      -      -      5      1      0.143  0.02   0      131N  
 9 drug     rep1_blue          62.1 5     1      0.143  0.02   0      -      -      -      -      -      128C  
10 drug     rep1_blue          64   -     -      -      -      -      5      1      0.143  0.02   0      131N  
11 drug     rep2_red           41.9 5     1      0.143  0.02   0      -      -      -      -      -      128C  
12 drug     rep2_red           43.6 -     -      -      -      -      5      1      0.143  0.02   0      131N  
13 drug     rep2_orange        45.7 5     1      0.143  0.02   0      -      -      -      -      -      128C  
14 drug     rep2_orange        48.7 -     -      -      -      -      5      1      0.143  0.02   0      131N  
15 drug     rep2_yellow        51.5 5     1      0.143  0.02   0      -      -      -      -      -      128C  
16 drug     rep2_yellow        54.2 -     -      -      -      -      5      1      0.143  0.02   0      131N  
17 drug     rep2_green         57   5     1      0.143  0.02   0      -      -      -      -      -      128C  
18 drug     rep2_green         59.9 -     -      -      -      -      5      1      0.143  0.02   0      131N  
19 drug     rep2_blue          62.1 5     1      0.143  0.02   0      -      -      -      -      -      128C  
20 drug     rep2_blue          64   -     -      -      -      -      5      1      0.143  0.02   0      131N

The data looks like:

> ll
$rep1_red
# A tibble: 6,618 × 43
   Protein                   ProteinID  `Entry Name` Gene   Length Organism `Protein Description` `Protein Existence` Coverage
   <chr>                     <chr>      <chr>        <chr>   <dbl> <chr>    <chr>                 <chr>                  <dbl>
 1 sp|A0A0B4J2D5|GAL3B_HUMAN A0A0B4J2D5 GAL3B_HUMAN  GATD3B    268 Homo sa… Putative glutamine a… 5:Protein uncertain    35.4 
 2 sp|A0A0U1RRE5|NBDY_HUMAN  A0A0U1RRE5 NBDY_HUMAN   NBDY       68 Homo sa… Negative regulator o… 1:Experimental evi…    77.9 
 3 sp|A0AVT1|UBA6_HUMAN      A0AVT1     UBA6_HUMAN   UBA6     1052 Homo sa… Ubiquitin-like modif… 1:Experimental evi…    25.9 
 4 sp|A0JNW5|BLT3B_HUMAN     A0JNW5     BLT3B_HUMAN  BLTP3B   1464 Homo sa… Bridge-like lipid tr… 1:Experimental evi…    11.5 
 5 sp|A0MZ66|SHOT1_HUMAN     A0MZ66     SHOT1_HUMAN  SHTN1     631 Homo sa… Shootin-1             1:Experimental evi…    55.5 
 6 sp|A0PJW6|TM223_HUMAN     A0PJW6     TM223_HUMAN  TMEM2…    202 Homo sa… Transmembrane protei… 1:Experimental evi…    13.4 
 7 sp|A1A5C7|S22AN_HUMAN     A1A5C7     S22AN_HUMAN  SLC22…    686 Homo sa… Solute carrier famil… 1:Experimental evi…     2.48
 8 sp|A1L0T0|HACL2_HUMAN     A1L0T0     HACL2_HUMAN  ILVBL     632 Homo sa… 2-hydroxyacyl-CoA ly… 1:Experimental evi…    12.8 
 9 sp|A1X283|SPD2B_HUMAN     A1X283     SPD2B_HUMAN  SH3PX…    911 Homo sa… SH3 and PX domain-co… 1:Experimental evi…    48.1 
10 sp|A2RRP1|NBAS_HUMAN      A2RRP1     NBAS_HUMAN   NBAS     2371 Homo sa… NBAS subunit of NRZ … 1:Experimental evi…     3.58
# ℹ 6,608 more rows
# ℹ 34 more variables: `Protein Probability` <dbl>, `Top Peptide Probability` <dbl>, `Total Peptides` <dbl>,
#   `Unique Peptides` <dbl>, `Razor Peptides` <dbl>, `Total Spectral Count` <dbl>, `Unique Spectral Count` <dbl>,
#   `Razor Spectral Count` <dbl>, `Total Intensity` <dbl>, `Unique Intensity` <dbl>, `Razor Intensity` <dbl>,
#   `Razor Assigned Modifications` <chr>, `Razor Observed Modifications` <lgl>, `Indistinguishable Proteins` <chr>,
#   INT_126 <dbl>, INT_127N <dbl>, INT_127C <dbl>, INT_128N <dbl>, INT_128C <dbl>, INT_129N <dbl>, INT_129C <dbl>,
#   INT_130N <dbl>, INT_130C <dbl>, INT_131N <dbl>, rel_fc_126 <dbl>, rel_fc_127N <dbl>, rel_fc_127C <dbl>, …
# ℹ Use `print(n = ...)` to see more rows

$rep1_orange
# A tibble: 6,545 × 43
   Protein                   ProteinID  `Entry Name` Gene   Length Organism `Protein Description` `Protein Existence` Coverage
   <chr>                     <chr>      <chr>        <chr>   <dbl> <chr>    <chr>                 <chr>                  <dbl>
 1 sp|A0A0B4J2D5|GAL3B_HUMAN A0A0B4J2D5 GAL3B_HUMAN  GATD3B    268 Homo sa… Putative glutamine a… 5:Protein uncertain    28.7 
 2 sp|A0A0U1RRE5|NBDY_HUMAN  A0A0U1RRE5 NBDY_HUMAN   NBDY       68 Homo sa… Negative regulator o… 1:Experimental evi…    77.9 
 3 sp|A0AVT1|UBA6_HUMAN      A0AVT1     UBA6_HUMAN   UBA6     1052 Homo sa… Ubiquitin-like modif… 1:Experimental evi…    22.9 
 4 sp|A0JNW5|BLT3B_HUMAN     A0JNW5     BLT3B_HUMAN  BLTP3B   1464 Homo sa… Bridge-like lipid tr… 1:Experimental evi…    12.4 
 5 sp|A0MZ66|SHOT1_HUMAN     A0MZ66     SHOT1_HUMAN  SHTN1     631 Homo sa… Shootin-1             1:Experimental evi…    50.4 
 6 sp|A0PJW6|TM223_HUMAN     A0PJW6     TM223_HUMAN  TMEM2…    202 Homo sa… Transmembrane protei… 1:Experimental evi…    22.3 
 7 sp|A1A5C7|S22AN_HUMAN     A1A5C7     S22AN_HUMAN  SLC22…    686 Homo sa… Solute carrier famil… 1:Experimental evi…     2.19
 8 sp|A1L0T0|HACL2_HUMAN     A1L0T0     HACL2_HUMAN  ILVBL     632 Homo sa… 2-hydroxyacyl-CoA ly… 1:Experimental evi…    14.1 
 9 sp|A1X283|SPD2B_HUMAN     A1X283     SPD2B_HUMAN  SH3PX…    911 Homo sa… SH3 and PX domain-co… 1:Experimental evi…    44.5 
10 sp|A2RRP1|NBAS_HUMAN      A2RRP1     NBAS_HUMAN   NBAS     2371 Homo sa… NBAS subunit of NRZ … 1:Experimental evi…     1.14
# ℹ 6,535 more rows
# ℹ 34 more variables: `Protein Probability` <dbl>, `Top Peptide Probability` <dbl>, `Total Peptides` <dbl>,
#   `Unique Peptides` <dbl>, `Razor Peptides` <dbl>, `Total Spectral Count` <dbl>, `Unique Spectral Count` <dbl>,
#   `Razor Spectral Count` <dbl>, `Total Intensity` <dbl>, `Unique Intensity` <dbl>, `Razor Intensity` <dbl>,
#   `Razor Assigned Modifications` <chr>, `Razor Observed Modifications` <lgl>, `Indistinguishable Proteins` <chr>,
#   INT_126 <dbl>, INT_127N <dbl>, INT_127C <dbl>, INT_128N <dbl>, INT_128C <dbl>, INT_129N <dbl>, INT_129C <dbl>,
#   INT_130N <dbl>, INT_130C <dbl>, INT_131N <dbl>, rel_fc_126 <dbl>, rel_fc_127N <dbl>, rel_fc_127C <dbl>, …
# ℹ Use `print(n = ...)` to see more rows
...

Importing only rep1 or rep2 works without any problem! Am I doing something wrong?

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] TPP2D_1.10.0 dplyr_1.1.2  dbplyr_2.3.2 tibble_3.2.1 readr_2.1.4 

loaded via a namespace (and not attached):
 [1] zip_2.3.0           Rcpp_1.0.10         pillar_1.9.0        compiler_4.1.2      iterators_1.0.14    bitops_1.0-7       
 [7] tools_4.1.2         bit_4.0.5           lifecycle_1.0.3     gtable_0.3.3        lattice_0.21-8      pkgconfig_2.0.3    
[13] rlang_1.1.1         openxlsx_4.2.5.2    foreach_1.5.2       DBI_1.1.3           cli_3.6.1           rstudioapi_0.14    
[19] parallel_4.1.2      stringr_1.5.0       withr_2.5.0         generics_0.1.3      vctrs_0.6.2         hms_1.1.3          
[25] bit64_4.0.5         grid_4.1.2          tidyselect_1.2.0    glue_1.6.2          R6_2.5.1            fansi_1.0.4        
[31] BiocParallel_1.28.3 vroom_1.6.3         limma_3.50.3        purrr_1.0.1         tidyr_1.3.0         tzdb_0.4.0         
[37] ggplot2_3.4.2       magrittr_2.0.3      codetools_0.2-19    scales_1.2.1        MASS_7.3-60         colorspace_2.1-0   
[43] utf8_1.2.3          stringi_1.7.12      doParallel_1.0.17   RCurl_1.98-1.12     munsell_0.5.0       crayon_1.5.2       
>

nkurzaw commented 1 year ago

Dear @tobiasko, as mentioned on the bioconductor support channel, the current implementation of the import function cannot handle replicates. Since from you email I understood that you can analyse each replicate separately without problems, there is a way to integrate TPP2D results obtained from two different replicates. It is a stringent approach that considers the lower F-statistic for each protein in each of the replicates under the constraint that the direction of the effect stabilisation/destabilisation is the same. Here is example code for integrating two replicates:

vem_r1_params_df <- getModelParamsDf(vem_r1_df, maxit = 500)
vem_r1_null_df <- bootstrapNullAlternativeModel(
   df = vem_r1_df, params_df = vem_r1_params_df, 
   maxit = 500, B = 100,
   BPPARAM = BiocParallel::MulticoreParam(workers = 10, progressbar = TRUE),
   verbose = FALSE)

vem_r1_fstat_df <- computeFStatFromParams(vem_r1_params_df)
vem_r1_fdr_df <- getFDR(vem_r1_fstat_df,
                        vem_r1_null_df)

vem_r2_params_df <- getModelParamsDf(vem_r2_df, maxit = 500)
vem_r2_null_df <- bootstrapNullAlternativeModel(
   df = vem_r2_df, params_df = vem_r2_params_df, 
   maxit = 500, B = 100,
   BPPARAM = BiocParallel::MulticoreParam(workers = 10, progressbar = TRUE),
   verbose = FALSE)

vem_r2_fstat_df <- computeFStatFromParams(vem_r2_params_df)
vem_r2_fdr_df <- getFDR(vem_r2_fstat_df,
                        vem_r2_null_df)

vem_fdr_combo_df <- 
  bind_rows(vem_r1_fdr_df,
            vem_r2_fdr_df) %>% 
  group_by(clustername, dataset) %>% 
  filter(n() > 1,  length(unique(sign(slopeH1))) == 1,
         F_statistic == min(F_statistic)) %>%
  ungroup %>% 
  group_by(nObsRound) %>%
  arrange(desc(F_statistic)) %>%
  mutate(max_rank = n(),
         rank = dense_rank(dplyr::desc(F_statistic)),
         is_decoy = ifelse(dataset != "true", 1, 0)) %>%
  mutate(all_null = sum(is_decoy),
         all_true = sum(!is_decoy),
         true_cumsum = cumsum(!is_decoy),
         null_cumsum = cumsum(is_decoy)) %>% 
  mutate(pi = (all_true-true_cumsum)/((all_null-null_cumsum)/B)) %>% 
  mutate(FDR = pi * (null_cumsum/B)/true_cumsum) %>% 
  ungroup() %>% 
  within(FDR[is.na(F_statistic)] <- NA)

vem_hits_combo_df <- findHits(vem_fdr_combo_df, 0.1)

Happy to hear if this works for you / the results make sense

tobiasko commented 1 year ago

Your code seems to expect an additional variable/column named B in the bootstrapped data:

> ## combine FDR stat
> ## https://github.com/nkurzaw/TPP2D/issues/8
> fdr_combo_df <- 
+   bind_rows(fdr_tab_df1_B20_mc, fdr_tab_df2_B20_mc) %>% 
+   group_by(clustername, dataset) %>% 
+   filter(n() > 1,  length(unique(sign(slopeH1))) == 1, F_statistic == min(F_statistic)) %>%
+   ungroup %>% 
+   group_by(nObsRound) %>%
+   arrange(desc(F_statistic)) %>%
+   mutate(max_rank = n(), rank = dense_rank(dplyr::desc(F_statistic)), is_decoy = ifelse(dataset != "true", 1, 0)) %>%
+   mutate(all_null = sum(is_decoy), all_true = sum(!is_decoy), true_cumsum = cumsum(!is_decoy), null_cumsum = cumsum(is_decoy)) %>% 
+   mutate(pi = (all_true-true_cumsum)/((all_null-null_cumsum)/B)) %>% 
+   mutate(FDR = pi * (null_cumsum/B)/true_cumsum) %>% 
+   ungroup() %>% 
+   within(FDR[is.na(F_statistic)] <- NA)
Error in `mutate()`:
ℹ In argument: `pi = (all_true - true_cumsum)/((all_null - null_cumsum)/B)`.
ℹ In group 1: `nObsRound = 20`.
Caused by error:
! object 'B' not found
Run `rlang::last_trace()` to see where the error occurred.
>

Is guess it is a counter for the bootstrap rounds performed by bootstrapNullAlternativeModel()? My tables were bootstrapped using the default B = 20 and but lack this column.

tobiasko commented 1 year ago

Def. a variable B in the global environment solves the code issue.

nkurzaw / TPP2D

import2Dataset() crashes when providing replicated measurements #8