perishky / meffil

Efficient algorithms for analyzing DNA methylation data.
Artistic License 2.0
53 stars 28 forks source link

Negative estimated cell counts? #11

Closed jnguofa closed 5 years ago

jnguofa commented 5 years ago

Hello, Thanks for the great package. I only have pre-processed betas (from cord - see distribution below) so I must use meffil.estimate.cell.counts.from.betas. There are a lot of negative values, especially for NK. Of note, I did remove SNP-related probes, XY chromosome probes and low variability (<5%) probes and batch corrected for plate.

cellcountref = "andrews and bakulski cord blood"

cellcounts3<-meffil.estimate.cell.counts.from.betas(ARIEScordbetapoint0stICA,cell.type.reference=cellcountref,verbose=T) head(cellcounts3) Bcell CD4T CD8T Gran Mono NK nRBC 35 0.000000e+00 0.19415426 1.436839e-01 0.4361359 2.176463e-18 2.699185e-19 -6.938894e-18 42 -5.335846e-18 0.19536009 0.000000e+00 0.5927273 0.000000e+00 -8.517630e-18 2.775558e-17 45 -4.336809e-19 0.19231770 5.951174e-20 0.5538809 6.938894e-18 -9.660659e-18 0.000000e+00 50 -5.403754e-18 0.08274238 9.387805e-02 0.5950767 0.000000e+00 -9.282177e-18 0.000000e+00 73 -2.185562e-18 0.00000000 4.426202e-03 0.8130283 -1.199445e-18 0.000000e+00 0.000000e+00 118 1.168658e-01 0.03662122 6.565257e-02 0.5457070 1.602137e-18 -1.042258e-18 0.000000e+00

cellcountref2 ="cord blood gse68456" cellcounts4<-meffil.estimate.cell.counts.from.betas(as.matrix(ARIEScordbetapoint0stICA),cell.type.reference=cellcountref2,verbose=T) head(cellcounts4) Bcell CD4T CD8T Gran Mono NK RBC 35 0.000000e+00 1.093173e-01 1.074017e-01 0.5587779 -3.088042e-19 3.544531e-03 0.03901592 42 6.938894e-18 8.018818e-02 6.655908e-02 0.6905427 0.000000e+00 0.000000e+00 0.00000000 45 0.000000e+00 6.050381e-02 9.786610e-02 0.6523306 0.000000e+00 0.000000e+00 0.00000000 50 0.000000e+00 6.646457e-02 5.677843e-02 0.6550116 2.477728e-02 0.000000e+00 0.00000000 73 -1.526199e-18 1.565778e-02 9.471646e-19 0.8668614 0.000000e+00 0.000000e+00 0.00000000 118 2.728014e-02 9.222220e-19 1.876269e-19 0.6439960 2.631825e-02 8.673617e-19 0.12577491

There are fewer negative values with the second reference but the distribution of estimated values between the two are very similar (see below)

Thanks for any thoughts, J

Below is my session info:

sessionInfo() R version 3.4.3 (2017-11-30) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale: [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252 LC_MONETARY=English_Canada.1252 LC_NUMERIC=C LC_TIME=English_Canada.1252

attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods base

other attached packages: [1] FlowSorted.CordBloodNorway.450k_1.4.0 minfi_1.24.0 bumphunter_1.20.0 locfit_1.5-9.1
[5] iterators_1.0.10 foreach_1.4.4 Biostrings_2.46.0 XVector_0.18.0
[9] SummarizedExperiment_1.8.1 DelayedArray_0.4.1 Biobase_2.38.0 GenomicRanges_1.30.3
[13] GenomeInfoDb_1.14.0 IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.24.0
[17] meffil_1.0.0 statmod_1.4.30 quadprog_1.5-5 DNAcopy_1.52.0
[21] fastICA_1.2-1 lme4_1.1-19 Matrix_1.2-15 multcomp_1.4-8
[25] TH.data_1.0-9 survival_2.43-3 mvtnorm_1.0-8 matrixStats_0.54.0
[29] markdown_0.9 gridExtra_2.3 Cairo_1.5-9 knitr_1.21
[33] reshape2_1.4.3 plyr_1.8.4 ggplot2_3.1.0 sva_3.26.0
[37] BiocParallel_1.12.0 genefilter_1.60.0 mgcv_1.8-26 nlme_3.1-137
[41] limma_3.34.9 MASS_7.3-51.1 illuminaio_0.20.0

loaded via a namespace (and not attached): [1] minqa_1.2.4 colorspace_1.4-0 siggenes_1.52.0 mclust_5.4.2 base64_2.0 rstudioapi_0.9.0 bit64_0.9-7
[8] AnnotationDbi_1.40.0 xml2_1.2.0 codetools_0.2-16 splines_3.4.3 nloptr_1.2.1 Rsamtools_1.30.0 annotate_1.56.2
[15] readr_1.3.1 compiler_3.4.3 httr_1.4.0 assertthat_0.2.0 lazyeval_0.2.1 prettyunits_1.0.2 tools_3.4.3
[22] bindrcpp_0.2.2 gtable_0.2.0 glue_1.3.0 GenomeInfoDbData_1.0.0 dplyr_0.7.8 doRNG_1.7.1 Rcpp_1.0.0
[29] multtest_2.34.0 preprocessCore_1.40.0 rtracklayer_1.38.3 xfun_0.4 stringr_1.3.1 rngtools_1.3.1 XML_3.98-1.16
[36] beanplot_1.2 zlibbioc_1.24.0 zoo_1.8-4 scales_1.0.0 hms_0.4.2 sandwich_2.5-0 GEOquery_2.46.15
[43] RColorBrewer_1.1-2 yaml_2.2.0 memoise_1.1.0 pkgmaker_0.27 biomaRt_2.34.2 reshape_0.8.8 stringi_1.2.4
[50] RSQLite_2.1.1 RMySQL_0.10.16 GenomicFeatures_1.30.3 bibtex_0.4.2 rlang_0.3.1 pkgconfig_2.0.2 bitops_1.0-6
[57] nor1mix_1.2-3 lattice_0.20-38 purrr_0.2.5 bindr_0.1.1 GenomicAlignments_1.14.2 bit_1.1-14 tidyselect_0.2.5
[64] magrittr_1.5 R6_2.3.0 DBI_1.0.0 pillar_1.3.1 withr_2.1.2 RCurl_1.95-4.11 tibble_2.0.1
[71] crayon_1.3.4 progress_1.2.0 grid_3.4.3 data.table_1.12.0 blob_1.1.1 digest_0.6.18 xtable_1.8-3
[78] tidyr_0.8.2 openssl_1.1 munsell_0.5.0 registry_0.5

Below is the histogram of my betas (to show it's bound by zero and 1.) image

Histogram of estimated cell counts4 image

Histogram of estimated cell counts3 image

perishky commented 5 years ago

If you look closely, you'll notice that the negative values are all extremely close to 0, e.g. the value -8.517630e-18 = -8.5 x 10^-18. They won't show up as negative in the histograms because the histogram doesn't provide that level of precision (the range size of the bars is 0.05). For analysis, these negative values will be no different from 0's.

jnguofa commented 5 years ago

Thanks!