stemangiola / tidybulk

Brings bulk and pseudobulk transcriptomics to the tidyverse
https://stemangiola.github.io/tidybulk/
165 stars 25 forks source link

Normalization method (scaling) #225

Closed ConYel closed 2 years ago

ConYel commented 3 years ago

Dear Stefano, thank you much for this very useful package! I would like to use it in my workflow but I fail to understand if it produces identical results regarding edgeR or limma packages. for example : (I follow one of the examples you suggest)

load libraries

library(dplyr)
library(tidybulk)
library(stringr)
library(edgeR)

load dataset

library(airway)

convert to tidybulk tibble

counts_tt <-
    airway %>%
    tidybulk() %>%
    mutate(sample=str_remove(sample, "SRR1039")) %>%
    ensembl_to_symbol(feature)

counts_scaled <- counts_tt %>% scale_abundance(factor_of_interest = dex)
counts_scaled %>% dplyr::count(TMM, multiplier ) 
# A tibble: 8 x 3
    TMM multiplier     n
  <dbl>      <dbl> <int>
1 0.944      1.44  64102
2 0.955      1.02  64102
3 0.977      0.687 64102
4 0.988      0.825 64102
5 1.02       1.07  64102
6 1.03       1.05  64102
7 1.03       0.820 64102
8 1.06       0.947 64102

edgeR

# Prepare data set
dgList <- SE2DGEList(airway)
group <- factor(dgList$samples$dex)
keep.exprs <- filterByExpr(dgList, group=group)
dgList <- dgList[keep.exprs,, keep.lib.sizes=FALSE]

## scaled counts
dgList_norm <- calcNormFactors(dgList)

dgList_norm$samples %>% arrange(norm.factors)
           group lib.size norm.factors SampleName    cell   dex albut        Run avgLength
SRR1039513     1 15144524    0.9484696 GSM1275867 N052611   trt untrt SRR1039513        87
SRR1039521     1 21135511    0.9538599 GSM1275875 N061011   trt untrt SRR1039521        98
SRR1039517     1 30776089    0.9779832 GSM1275871 N080611   trt untrt SRR1039517       126
SRR1039512     1 25311320    0.9904567 GSM1275866 N052611 untrt untrt SRR1039512       126
SRR1039509     1 18783120    1.0212137 GSM1275863  N61311   trt untrt SRR1039509       126
SRR1039520     1 19094104    1.0269341 GSM1275874 N061011 untrt untrt SRR1039520       101
SRR1039516     1 24411867    1.0309324 GSM1275870 N080611 untrt untrt SRR1039516       120
SRR1039508     1 20608402    1.0554452 GSM1275862  N61311 untrt untrt SRR1039508       126
           Experiment    Sample    BioSample
SRR1039513  SRX384350 SRS508572 SAMN02422670
SRR1039521  SRX384358 SRS508580 SAMN02422677
SRR1039517  SRX384354 SRS508576 SAMN02422673
SRR1039512  SRX384349 SRS508571 SAMN02422678
SRR1039509  SRX384346 SRS508567 SAMN02422675
SRR1039520  SRX384357 SRS508579 SAMN02422683
SRR1039516  SRX384353 SRS508575 SAMN02422682
SRR1039508  SRX384345 SRS508568 SAMN02422669

check logcpm

counts_scaled %>% 
select(feature, sample, counts, counts_scaled) %>% 
mutate( log_scl = log10(counts_scaled + 4)) %>% 
filter( feature   %in% c("ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "ENSG00000000460", "ENSG00000000971", "ENSG00000001036")) %>%  
select(-c(counts, counts_scaled)) %>% 
pivot_wider(names_from = sample, values_from = log_scl)
# A tibble: 6 x 9
  feature         SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
  <chr>                <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
1 ENSG00000000003       2.81       2.69       2.86       2.77       2.97       2.86       2.91       2.77
2 ENSG00000000419       2.65       2.75       2.71       2.72       2.69       2.74       2.65       2.72
3 ENSG00000000457       2.40       2.36       2.34       2.38       2.31       2.36       2.40       2.38
4 ENSG00000000460       1.78       1.80       1.57       1.74       1.83       1.67       1.92       1.82
5 ENSG00000000971       3.49       3.60       3.71       3.79       3.74       3.88       3.74       3.91
6 ENSG00000001036       3.13       3.06       3.16       3.11       3.07       3.00       3.16       3.06

##edgeR
lcpm_n <- cpm(dgList_norm, log=TRUE)

lcpm_n %>% head
                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
ENSG00000000003   4.968465   4.551329  5.1257352   4.832653   5.501732   5.124204   5.298663   4.831056
ENSG00000000419   4.430383   4.751672  4.6358756   4.672524   4.549387   4.735381   4.416679   4.660451
ENSG00000000457   3.590323   3.471364  3.4035325   3.524635   3.296633   3.470985   3.581820   3.517297
ENSG00000000460   1.510814   1.564892  0.7542605   1.337909   1.673791   1.127202   1.988089   1.616999
ENSG00000000971   7.224535   7.584134  7.9453477   8.209972   8.061492   8.517492   8.044721   8.631768
ENSG00000001036   6.043806   5.793293  6.1130816   5.940744   5.824597   5.581981   6.116828   5.783987

I probably did some calculation wrong, but neither $TMM nor $multiplier have any values identical to dgList_norm$samples$norm.factors

Is it possible to provide some extra information in this matter?

stemangiola commented 3 years ago

Thanks! Which workshop are you following?

Some comments

counts_tt <- airway %>% tidybulk() %>%

You don't need to transform to tidybulk from SummarizedExperiment necessarily.

counts_scaled <- counts_tt %>% scale_abundance(factor_of_interest = dex)

You need to identify (or keep) abundant gene transcripts, before scaling

counts_tt %>% identify_abundant(factor_of_interest = ...) %>% scale_abundance(...)

This will execute

filterByExpr groups

Let us know how you go. Do the TMM match now?

ConYel commented 3 years ago

Hello Stefano! I started from this one: https://stemangiola.github.io/bioc_2020_tidytranscriptomics/articles/tidytranscriptomics.html My goal is to see if I can use all the functions of edgeR and limma through tidybulk in order to reform a workflow for smallRNAs.

I didn't use the identify_abundant because I get:

counts_tt %>% identify_abundant(factor_of_interest = "dex")
Error in identify_abundant(., factor_of_interest = "dex") : 
  could not find function "identify_abundant"

My session info:

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dplyr_1.0.3                 airway_1.8.0                SummarizedExperiment_1.18.2
 [4] DelayedArray_0.14.1         matrixStats_0.58.0          Biobase_2.48.0             
 [7] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2         IRanges_2.22.2             
[10] S4Vectors_0.26.1            BiocGenerics_0.34.0         tidybulk_1.0.2             

loaded via a namespace (and not attached):
 [1] pillar_1.4.7           compiler_4.0.3         RColorBrewer_1.1-2    
 [4] BiocManager_1.30.10    XVector_0.28.0         zlibbioc_1.34.0       
 [7] bitops_1.0-6           tools_4.0.3            digest_0.6.27         
[10] lattice_0.20-41        evaluate_0.14          lifecycle_0.2.0       
[13] tibble_3.0.5           preprocessCore_1.50.0  pkgconfig_2.0.3       
[16] rlang_0.4.10           Matrix_1.2-18          rstudioapi_0.13       
[19] cli_2.3.0              DBI_1.1.1              yaml_2.2.1            
[22] xfun_0.20              GenomeInfoDbData_1.2.3 knitr_1.31            
[25] generics_0.1.0         vctrs_0.3.6            hms_1.0.0             
[28] grid_4.0.3             tidyselect_1.1.0       glue_1.4.2            
[31] R6_2.5.0               fansi_0.4.2            rmarkdown_2.6         
[34] purrr_0.3.4            readr_1.4.0            tidyr_1.1.2           
[37] magrittr_2.0.1         ellipsis_0.3.1         htmltools_0.5.1.1     
[40] assertthat_0.2.1       utf8_1.1.4             RCurl_1.98-1.2        
[43] crayon_1.4.0 
stemangiola commented 3 years ago

Hello @ConYel,

I think you are using an old version.

I suggest to

Let us know how you go.

ConYel commented 2 years ago

Hello @stemangiola I updated everything as you suggested but still it doesn't give the same results.

my new session:

sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] tibble_3.1.5                   tidyr_1.1.4                    airway_1.14.0                 
 [4] tidySummarizedExperiment_1.4.1 SummarizedExperiment_1.24.0    Biobase_2.54.0                
 [7] GenomicRanges_1.46.0           GenomeInfoDb_1.30.0            IRanges_2.28.0                
[10] S4Vectors_0.32.0               BiocGenerics_0.40.0            MatrixGenerics_1.6.0          
[13] matrixStats_0.61.0             edgeR_3.36.0                   limma_3.50.0                  
[16] stringr_1.4.0                  tidybulk_1.6.1                 dplyr_1.0.7                   

loaded via a namespace (and not attached):
 [1] httr_1.4.2             viridisLite_0.4.0      jsonlite_1.7.2         bit64_4.0.5           
 [5] assertthat_0.2.1       BiocManager_1.30.16    blob_1.2.2             GenomeInfoDbData_1.2.7
 [9] yaml_2.2.1             pillar_1.6.4           RSQLite_2.2.8          lattice_0.20-45       
[13] glue_1.4.2             digest_0.6.28          RColorBrewer_1.1-2     XVector_0.34.0        
[17] colorspace_2.0-2       htmltools_0.5.2        preprocessCore_1.56.0  Matrix_1.3-4          
[21] pkgconfig_2.0.3        zlibbioc_1.40.0        purrr_0.3.4            scales_1.1.1          
[25] tzdb_0.2.0             KEGGREST_1.34.0        generics_0.1.1         ggplot2_3.3.5         
[29] ellipsis_0.3.2         cachem_1.0.6           lazyeval_0.2.2         cli_3.1.0             
[33] magrittr_2.0.1         crayon_1.4.2           memoise_2.0.0          evaluate_0.14         
[37] fansi_0.5.0            data.table_1.14.2      tools_4.1.2            org.Hs.eg.db_3.14.0   
[41] hms_1.1.1              lifecycle_1.0.1        plotly_4.10.0          munsell_0.5.0         
[45] locfit_1.5-9.4         DelayedArray_0.20.0    AnnotationDbi_1.56.1   Biostrings_2.62.0     
[49] compiler_4.1.2         rlang_0.4.12           grid_4.1.2             RCurl_1.98-1.5        
[53] rstudioapi_0.13        htmlwidgets_1.5.4      bitops_1.0-7           rmarkdown_2.11        
[57] gtable_0.3.0           DBI_1.1.1              R6_2.5.1               knitr_1.36            
[61] fastmap_1.1.0          bit_4.0.4              utf8_1.2.2             readr_2.0.2           
[65] stringi_1.7.5          parallel_4.1.2         Rcpp_1.0.7             vctrs_0.3.8           
[69] png_0.1-7              tidyselect_1.1.1       xfun_0.27             

Results

counts_scaled %>%  group_by(sample, TMM, multiplier ) %>% summarise()
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
`summarise()` has grouped output by 'sample', 'TMM'. You can override using the `.groups` argument.
# A tibble: 8 x 3
# Groups:   sample, TMM [8]
  sample       TMM multiplier
  <chr>      <dbl>      <dbl>
1 SRR1039508 1.06        1.42
2 SRR1039509 1.02        1.60
3 SRR1039512 0.987       1.23
4 SRR1039513 0.951       2.14
5 SRR1039516 1.03        1.22
6 SRR1039517 0.975       1.03
7 SRR1039520 1.02        1.58
8 SRR1039521 0.959       1.52
Warning message:
In is_sample_feature_deprecated_used(.data, (enquos(..., .ignore_empty = "all") %>%  :
  tidySummarizedExperiment says: from version 1.3.1, the special columns including sample/feature id (colnames(se), rownames(se)) has changed to ".sample" and ".feature". This dataset is returned with the old-style vocabulary (feature and sample), however we suggest to update your workflow to reflect the new vocabulary (.feature, .sample)

dgList_norm$samples %>% arrange(norm.factors) %>% select(norm.factors)
           norm.factors
SRR1039513    0.9484696
SRR1039521    0.9538599
SRR1039517    0.9779832
SRR1039512    0.9904567
SRR1039509    1.0212137
SRR1039520    1.0269341
SRR1039516    1.0309324
SRR1039508    1.0554452

counts_scaled %>% 
+ select(feature, sample, counts, counts_scaled) %>% 
+ mutate( log_scl = log10(counts_scaled + 4)) %>% 
+ filter( feature   %in% c("ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "ENSG00000000460", "ENSG00000000971", "ENSG00000001036")) %>%  
+ select(-c(counts, counts_scaled)) %>% 
+ pivot_wider(names_from = sample, values_from = log_scl)
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 6 x 9
  feature         SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
  <chr>                <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
1 ENSG00000000003       2.98       2.86       3.03       2.94       3.14       3.03       3.09       2.94
2 ENSG00000000419       2.82       2.92       2.89       2.89       2.86       2.92       2.82       2.89
3 ENSG00000000457       2.57       2.53       2.52       2.55       2.48       2.54       2.57       2.55
4 ENSG00000000460       1.95       1.96       1.73       1.90       2.00       1.84       2.09       1.98
5 ENSG00000000971       3.66       3.77       3.88       3.96       3.92       4.05       3.91       4.08
6 ENSG00000001036       3.31       3.23       3.33       3.28       3.24       3.17       3.33       3.23

lcpm_n %>% head
                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
ENSG00000000003   4.968465   4.551329  5.1257352   4.832653   5.501732   5.124204   5.298663   4.831056
ENSG00000000419   4.430383   4.751672  4.6358756   4.672524   4.549387   4.735381   4.416679   4.660451
ENSG00000000457   3.590323   3.471364  3.4035325   3.524635   3.296633   3.470985   3.581820   3.517297
ENSG00000000460   1.510814   1.564892  0.7542605   1.337909   1.673791   1.127202   1.988089   1.616999
ENSG00000000971   7.224535   7.584134  7.9453477   8.209972   8.061492   8.517492   8.044721   8.631768
ENSG00000001036   6.043806   5.793293  6.1130816   5.940744   5.824597   5.581981   6.116828   5.783987
stemangiola commented 2 years ago

Hello @ConYel

I think it depends on the reference sample that tidybulk and your other code are choosing.

reference_sample | A character string. The name of the reference sample. If NULL the sample with highest total read count will be selected as reference.
library(tidybulk)
data("se_mini")
# Just getting a tibble out of a SE for testing purposes
input_df = se_mini %>% tidybulk() %>% as_tibble() %>% setNames(c("b","a",  "c", "Cell type", "time" , "condition"))

input_df %>% 
    identify_abundant(a, b, c) %>%
    scale_abundance(
        .sample = a,
        .transcript = b,
        .abundance = c, 
        reference_sample = "SRR1740035",
        action = "only"
    )
# A tibble: 5 x 3
  a            TMM multiplier
  <fct>      <dbl>      <dbl>
1 SRR1740034 0.871      1.30 
2 SRR1740035 0.849      1.18 
3 SRR1740043 0.843      2.70 
4 SRR1740058 1.13       0.970
5 SRR1740067 1.42       1.83 

Now, from input_df try to use your edgeR code to get the norm.factors choosing the same reference sample.

input_df %>% 
rename(counts = c) %>% 
    as_SummarizedExperiment(a,b,counts) %>%
     keep_abundant() %>%
     edgeR::calcNormFactors(refColumn=2)
$counts
      SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
ACAP1       7151       8028       1470      12264        616
ACP5        2278       2666        350        489        264
ADRB2        298        368        116        153        722
AIM2        3050       3004        304         56        357
ALOX5      10064      11521       9457        118       7852
177 more rows ...

$samples
           group lib.size norm.factors         Cell.type time condition
SRR1740034     1   520515    0.8709091            b_cell  0 d      TRUE
SRR1740035     1   589046    0.8488030            b_cell  1 d      TRUE
SRR1740043     1   258771    0.8431936          monocyte  1 d     FALSE
SRR1740058     1   536818    1.1309222            t_cell  0 d      TRUE
SRR1740067     1   227024    1.4186009 dendritic_myeloid  1 d     FALSE

The TMM factor match, we then calculate scaled counts. The final counts may vary up to a scalar depending on what reference you decide, and small details. Nonetheless, if you plot tidybulk scaled counts, with your own scaled counts (provided the same TMM) should sit at a 45' line. The rest of the analyses will provide the same outcome.

ConYel commented 2 years ago

Hello @stemangiola Thank you for the clarification! I followed your advice and I managed to get the same norm.factors and TMM values using the same sample as reference.

I saw that there was reference_selection_function which is Deprecated.

In the calcNormFactors it is used: If refColumn is unspecified, then the column whose count-per-million upper quartile is closest to the mean upper quartile is set as the reference library. Did reference_selection_function had the same option it has edgeR when refColumn is NULL?

Is it possible to include it as a default (for NULL) or as an option, or as a warning in which it will inform which sample the scale_abundance() is using as reference?

Thank you for your time!

stemangiola commented 2 years ago

If refColumn is unspecified, then the column whose count-per-million upper quartile is closest to the mean upper quartile is set as the reference library.

This would make sense. However, while for modelling does not really matter which reference you pick, for calculating scaled counts for visualisation, and other statistics, we would lose some resolution for lowly transcribed transcripts in case a sample was deeply sequenced. If we pick the most deeply sequenced as reference we would not lose any resolution, at the cost to insert some quantization random error for the shallow sequenced.

I would need a more compelling argument, but it is a good point of discussion.

or as a warning in which it will inform which sample the scale_abundance() is using as reference

This is less impactful. I will implement this as a message.

stemangiola commented 2 years ago

Thanks @ConYel for your issue. I implemented your suggestion. Feel free to reopen if necessarily.

ConYel commented 2 years ago

Thank you @stemangiola for the clarification!

I totally understand your point!

My thinking was more about "backward compatibility" with the original function. In order to also keep reproducibility between workflows, in case someone wants to move from the original to the tidy way.