Closed ConYel closed 2 years ago
Thanks! Which workshop are you following?
Some comments
counts_tt <- airway %>% tidybulk() %>%
You don't need to transform to tidybulk from SummarizedExperiment necessarily.
counts_scaled <- counts_tt %>% scale_abundance(factor_of_interest = dex)
You need to identify (or keep) abundant gene transcripts, before scaling
counts_tt %>% identify_abundant(factor_of_interest = ...) %>% scale_abundance(...)
This will execute
filterByExpr groups
Let us know how you go. Do the TMM match now?
Hello Stefano! I started from this one: https://stemangiola.github.io/bioc_2020_tidytranscriptomics/articles/tidytranscriptomics.html My goal is to see if I can use all the functions of edgeR and limma through tidybulk in order to reform a workflow for smallRNAs.
I didn't use the identify_abundant
because I get:
counts_tt %>% identify_abundant(factor_of_interest = "dex")
Error in identify_abundant(., factor_of_interest = "dex") :
could not find function "identify_abundant"
My session info:
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_1.0.3 airway_1.8.0 SummarizedExperiment_1.18.2
[4] DelayedArray_0.14.1 matrixStats_0.58.0 Biobase_2.48.0
[7] GenomicRanges_1.40.0 GenomeInfoDb_1.24.2 IRanges_2.22.2
[10] S4Vectors_0.26.1 BiocGenerics_0.34.0 tidybulk_1.0.2
loaded via a namespace (and not attached):
[1] pillar_1.4.7 compiler_4.0.3 RColorBrewer_1.1-2
[4] BiocManager_1.30.10 XVector_0.28.0 zlibbioc_1.34.0
[7] bitops_1.0-6 tools_4.0.3 digest_0.6.27
[10] lattice_0.20-41 evaluate_0.14 lifecycle_0.2.0
[13] tibble_3.0.5 preprocessCore_1.50.0 pkgconfig_2.0.3
[16] rlang_0.4.10 Matrix_1.2-18 rstudioapi_0.13
[19] cli_2.3.0 DBI_1.1.1 yaml_2.2.1
[22] xfun_0.20 GenomeInfoDbData_1.2.3 knitr_1.31
[25] generics_0.1.0 vctrs_0.3.6 hms_1.0.0
[28] grid_4.0.3 tidyselect_1.1.0 glue_1.4.2
[31] R6_2.5.0 fansi_0.4.2 rmarkdown_2.6
[34] purrr_0.3.4 readr_1.4.0 tidyr_1.1.2
[37] magrittr_2.0.1 ellipsis_0.3.1 htmltools_0.5.1.1
[40] assertthat_0.2.1 utf8_1.1.4 RCurl_1.98-1.2
[43] crayon_1.4.0
Hello @ConYel,
I think you are using an old version.
I suggest to
Let us know how you go.
Hello @stemangiola I updated everything as you suggested but still it doesn't give the same results.
my new session:
sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] tibble_3.1.5 tidyr_1.1.4 airway_1.14.0
[4] tidySummarizedExperiment_1.4.1 SummarizedExperiment_1.24.0 Biobase_2.54.0
[7] GenomicRanges_1.46.0 GenomeInfoDb_1.30.0 IRanges_2.28.0
[10] S4Vectors_0.32.0 BiocGenerics_0.40.0 MatrixGenerics_1.6.0
[13] matrixStats_0.61.0 edgeR_3.36.0 limma_3.50.0
[16] stringr_1.4.0 tidybulk_1.6.1 dplyr_1.0.7
loaded via a namespace (and not attached):
[1] httr_1.4.2 viridisLite_0.4.0 jsonlite_1.7.2 bit64_4.0.5
[5] assertthat_0.2.1 BiocManager_1.30.16 blob_1.2.2 GenomeInfoDbData_1.2.7
[9] yaml_2.2.1 pillar_1.6.4 RSQLite_2.2.8 lattice_0.20-45
[13] glue_1.4.2 digest_0.6.28 RColorBrewer_1.1-2 XVector_0.34.0
[17] colorspace_2.0-2 htmltools_0.5.2 preprocessCore_1.56.0 Matrix_1.3-4
[21] pkgconfig_2.0.3 zlibbioc_1.40.0 purrr_0.3.4 scales_1.1.1
[25] tzdb_0.2.0 KEGGREST_1.34.0 generics_0.1.1 ggplot2_3.3.5
[29] ellipsis_0.3.2 cachem_1.0.6 lazyeval_0.2.2 cli_3.1.0
[33] magrittr_2.0.1 crayon_1.4.2 memoise_2.0.0 evaluate_0.14
[37] fansi_0.5.0 data.table_1.14.2 tools_4.1.2 org.Hs.eg.db_3.14.0
[41] hms_1.1.1 lifecycle_1.0.1 plotly_4.10.0 munsell_0.5.0
[45] locfit_1.5-9.4 DelayedArray_0.20.0 AnnotationDbi_1.56.1 Biostrings_2.62.0
[49] compiler_4.1.2 rlang_0.4.12 grid_4.1.2 RCurl_1.98-1.5
[53] rstudioapi_0.13 htmlwidgets_1.5.4 bitops_1.0-7 rmarkdown_2.11
[57] gtable_0.3.0 DBI_1.1.1 R6_2.5.1 knitr_1.36
[61] fastmap_1.1.0 bit_4.0.4 utf8_1.2.2 readr_2.0.2
[65] stringi_1.7.5 parallel_4.1.2 Rcpp_1.0.7 vctrs_0.3.8
[69] png_0.1-7 tidyselect_1.1.1 xfun_0.27
counts_scaled %>% group_by(sample, TMM, multiplier ) %>% summarise()
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
`summarise()` has grouped output by 'sample', 'TMM'. You can override using the `.groups` argument.
# A tibble: 8 x 3
# Groups: sample, TMM [8]
sample TMM multiplier
<chr> <dbl> <dbl>
1 SRR1039508 1.06 1.42
2 SRR1039509 1.02 1.60
3 SRR1039512 0.987 1.23
4 SRR1039513 0.951 2.14
5 SRR1039516 1.03 1.22
6 SRR1039517 0.975 1.03
7 SRR1039520 1.02 1.58
8 SRR1039521 0.959 1.52
Warning message:
In is_sample_feature_deprecated_used(.data, (enquos(..., .ignore_empty = "all") %>% :
tidySummarizedExperiment says: from version 1.3.1, the special columns including sample/feature id (colnames(se), rownames(se)) has changed to ".sample" and ".feature". This dataset is returned with the old-style vocabulary (feature and sample), however we suggest to update your workflow to reflect the new vocabulary (.feature, .sample)
dgList_norm$samples %>% arrange(norm.factors) %>% select(norm.factors)
norm.factors
SRR1039513 0.9484696
SRR1039521 0.9538599
SRR1039517 0.9779832
SRR1039512 0.9904567
SRR1039509 1.0212137
SRR1039520 1.0269341
SRR1039516 1.0309324
SRR1039508 1.0554452
counts_scaled %>%
+ select(feature, sample, counts, counts_scaled) %>%
+ mutate( log_scl = log10(counts_scaled + 4)) %>%
+ filter( feature %in% c("ENSG00000000003", "ENSG00000000419", "ENSG00000000457", "ENSG00000000460", "ENSG00000000971", "ENSG00000001036")) %>%
+ select(-c(counts, counts_scaled)) %>%
+ pivot_wider(names_from = sample, values_from = log_scl)
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
# A tibble: 6 x 9
feature SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000000003 2.98 2.86 3.03 2.94 3.14 3.03 3.09 2.94
2 ENSG00000000419 2.82 2.92 2.89 2.89 2.86 2.92 2.82 2.89
3 ENSG00000000457 2.57 2.53 2.52 2.55 2.48 2.54 2.57 2.55
4 ENSG00000000460 1.95 1.96 1.73 1.90 2.00 1.84 2.09 1.98
5 ENSG00000000971 3.66 3.77 3.88 3.96 3.92 4.05 3.91 4.08
6 ENSG00000001036 3.31 3.23 3.33 3.28 3.24 3.17 3.33 3.23
lcpm_n %>% head
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
ENSG00000000003 4.968465 4.551329 5.1257352 4.832653 5.501732 5.124204 5.298663 4.831056
ENSG00000000419 4.430383 4.751672 4.6358756 4.672524 4.549387 4.735381 4.416679 4.660451
ENSG00000000457 3.590323 3.471364 3.4035325 3.524635 3.296633 3.470985 3.581820 3.517297
ENSG00000000460 1.510814 1.564892 0.7542605 1.337909 1.673791 1.127202 1.988089 1.616999
ENSG00000000971 7.224535 7.584134 7.9453477 8.209972 8.061492 8.517492 8.044721 8.631768
ENSG00000001036 6.043806 5.793293 6.1130816 5.940744 5.824597 5.581981 6.116828 5.783987
Hello @ConYel
I think it depends on the reference sample that tidybulk and your other code are choosing.
reference_sample | A character string. The name of the reference sample. If NULL the sample with highest total read count will be selected as reference.
library(tidybulk)
data("se_mini")
# Just getting a tibble out of a SE for testing purposes
input_df = se_mini %>% tidybulk() %>% as_tibble() %>% setNames(c("b","a", "c", "Cell type", "time" , "condition"))
input_df %>%
identify_abundant(a, b, c) %>%
scale_abundance(
.sample = a,
.transcript = b,
.abundance = c,
reference_sample = "SRR1740035",
action = "only"
)
# A tibble: 5 x 3
a TMM multiplier
<fct> <dbl> <dbl>
1 SRR1740034 0.871 1.30
2 SRR1740035 0.849 1.18
3 SRR1740043 0.843 2.70
4 SRR1740058 1.13 0.970
5 SRR1740067 1.42 1.83
Now, from input_df
try to use your edgeR code to get the norm.factors
choosing the same reference sample.
input_df %>%
rename(counts = c) %>%
as_SummarizedExperiment(a,b,counts) %>%
keep_abundant() %>%
edgeR::calcNormFactors(refColumn=2)
$counts
SRR1740034 SRR1740035 SRR1740043 SRR1740058 SRR1740067
ACAP1 7151 8028 1470 12264 616
ACP5 2278 2666 350 489 264
ADRB2 298 368 116 153 722
AIM2 3050 3004 304 56 357
ALOX5 10064 11521 9457 118 7852
177 more rows ...
$samples
group lib.size norm.factors Cell.type time condition
SRR1740034 1 520515 0.8709091 b_cell 0 d TRUE
SRR1740035 1 589046 0.8488030 b_cell 1 d TRUE
SRR1740043 1 258771 0.8431936 monocyte 1 d FALSE
SRR1740058 1 536818 1.1309222 t_cell 0 d TRUE
SRR1740067 1 227024 1.4186009 dendritic_myeloid 1 d FALSE
The TMM factor match, we then calculate scaled counts. The final counts may vary up to a scalar depending on what reference you decide, and small details. Nonetheless, if you plot tidybulk scaled counts, with your own scaled counts (provided the same TMM) should sit at a 45' line. The rest of the analyses will provide the same outcome.
Hello @stemangiola
Thank you for the clarification!
I followed your advice and I managed to get the same norm.factors
and TMM
values using the same sample as reference.
I saw that there was reference_selection_function
which is Deprecated.
In the calcNormFactors
it is used:
If refColumn is unspecified, then the column whose count-per-million upper quartile is closest to the mean upper quartile is set as the reference library.
Did reference_selection_function
had the same option it has edgeR when refColumn is NULL
?
Is it possible to include it as a default (for NULL) or as an option, or as a warning in which it will inform which sample the scale_abundance()
is using as reference?
Thank you for your time!
If refColumn is unspecified, then the column whose count-per-million upper quartile is closest to the mean upper quartile is set as the reference library.
This would make sense. However, while for modelling does not really matter which reference you pick, for calculating scaled counts for visualisation, and other statistics, we would lose some resolution for lowly transcribed transcripts in case a sample was deeply sequenced. If we pick the most deeply sequenced as reference we would not lose any resolution, at the cost to insert some quantization random error for the shallow sequenced.
I would need a more compelling argument, but it is a good point of discussion.
or as a warning in which it will inform which sample the
scale_abundance()
is using as reference
This is less impactful. I will implement this as a message.
Thanks @ConYel for your issue. I implemented your suggestion. Feel free to reopen if necessarily.
Thank you @stemangiola for the clarification!
I totally understand your point!
My thinking was more about "backward compatibility" with the original function. In order to also keep reproducibility between workflows, in case someone wants to move from the original to the tidy way.
Dear Stefano, thank you much for this very useful package! I would like to use it in my workflow but I fail to understand if it produces identical results regarding edgeR or limma packages. for example : (I follow one of the examples you suggest)
load libraries
load dataset
convert to tidybulk tibble
edgeR
check logcpm
I probably did some calculation wrong, but neither $TMM nor $multiplier have any values identical to
dgList_norm$samples$norm.factors
Is it possible to provide some extra information in this matter?