missing data is still present after imputation in RF

Astahlke commented 5 years ago

Hi Thierry,

We have a new RF issue with v0.0.20, where a warning indicates that there's still missing data after imputation, even though I don't see any NA in the imputed genlight object.

Thanks for any help!

Amanda

>   gc <- radiator::genomic_converter(data = miss.genlight,
+                                     output = "genlight",
+                                     imputation.method = "rf",
+                                     monomorphic.out = FALSE,
+                                     hierarchical.levels = "global",
+                                     verbose = TRUE)

####################################################################### ##################### radiator::genomic_converter ##################### ####################################################################### Function arguments and values: Working directory: /mnt/ceph/stah3621/imputation Input file: from global environment Strata: no Population levels: no Population labels: no Output format(s): tidy, genlight Filename prefix: no Filters: Blacklist of individuals: no Blacklist of genotypes: no Whitelist of markers: no monomorphic.out: FALSE snp.ld: no common.markers: TRUE max.marker: no pop.select: no maf.thresholds: no

Imputations options: imputation.method: rf hierarchical.levels: global

parallel.core: 47

#######################################################################

Importing data

Number of markers missing in all individuals and removed: 1

Tidy genomic data: Number of markers: 500 Number of chromosome/contig/scaffold: 1 Number of individuals: 94

Preparing data for output

Data is bi-allelic

####################################################################### ####################### grur::grur_imputations ######################## ####################################################################### Imputation method: rf Hierarchical levels: global On-the-fly-imputations options: number of trees to grow: 50 minimum terminal node size: 1 non-negative integer value used to specify random splitting: 10 number of iterations: 10 Number of CPUs: 47 Note: If you have speed issues: follow radiator's vignette on parallel computing

Number of populations: 1 Number of individuals: 94 Number of markers: 500

Proportion of missing genotypes before imputations: 0.298319 On-the-fly-imputations using Random Forests algorithm Imputations computed globally, take a break... Adjusting REF/ALT alleles to account for imputations... generating REF/ALT dictionary integrating new genotype codings...

Proportion of missing genotypes after imputations: 0

Computation time: 8 sec ################## grur::grur_imputations completed ################### Generating adegenet genlight object without imputation Generating adegenet genlight object WITH imputations

Writing tidy data set: radiator_data_20190117@1007.rad

Writing tidy data set: radiator_data_20190117@1007.rad ############################### RESULTS ############################### Data format of input: genlight Biallelic data Number of common markers: 500 Number of chromosome/contig/scaffold: 1 Number of individuals 94

Computation time: 11 sec ################ radiator::genomic_converter completed ################ Warning messages: 1: In cleanup(mc.cleanup) : unable to terminate child: No such process 2: In radiator::radiator_imputations_module(data = input, imputation.method = imputation.method, : Missing data is still present in the dataset 2 options: run the function again with hierarchical.levels = 'global' use common.markers = TRUE when using hierarchical.levels = 'strata'

> which(is.na(as.matrix(gc$genlight.imputed))) integer(0)

> which(is.na(as.matrix(gc$genlight.no.imputation)))[1:10] [1] 1 2 4 5 8 9 11 12 16 17

> sessionInfo() R version 3.5.0 (2018-04-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

Matrix products: default BLAS: /opt/modules/devel/R/3.5.0/lib64/R/lib/libRblas.so LAPACK: /opt/modules/devel/R/3.5.0/lib64/R/lib/libRlapack.so

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods [7] base

other attached packages: [1] bindrcpp_0.2.2 randomForestSRC_2.8.0 psych_1.8.10 [4] vegan_2.5-3 lattice_0.20-38 permute_0.9-4 [7] tidyr_0.8.2 adegenet_2.1.1 ade4_1.7-13 [10] radiator_0.0.18

loaded via a namespace (and not attached): [1] nlme_3.1-137 fs_1.2.6 usethis_1.4.0 [4] devtools_2.0.1 gmodels_2.18.1 rprojroot_1.3-2 [7] tools_3.5.0 backports_1.1.3 R6_2.3.0 [10] spData_0.3.0 lazyeval_0.2.1 mgcv_1.8-26 [13] colorspace_1.4-0 withr_2.1.2 sp_1.3-1 [16] tidyselect_0.2.5 prettyunits_1.0.2 mnormt_1.5-5 [19] processx_3.2.1 curl_3.3 compiler_3.5.0 [22] cli_1.0.1 expm_0.999-3 desc_1.2.0 [25] scales_1.0.0 readr_1.3.1 callr_3.1.1 [28] stringr_1.3.1 digest_0.6.18 foreign_0.8-71 [31] pkgconfig_2.0.2 htmltools_0.3.6 fst_0.8.10 [34] sessioninfo_1.1.1 rlang_0.3.1 shiny_1.2.0 [37] bindr_0.1.1 gtools_3.8.1 spdep_0.8-1 [40] dplyr_0.7.8 magrittr_1.5 Matrix_1.2-15 [43] Rcpp_1.0.0 munsell_0.5.0 ape_5.2 [46] stringi_1.2.4 MASS_7.3-51.1 pkgbuild_1.0.2 [49] plyr_1.8.4 grid_3.5.0 parallel_3.5.0 [52] gdata_2.18.0 listenv_0.7.0 promises_1.0.1 [55] crayon_1.3.4 deldir_0.1-15 splines_3.5.0 [58] hms_0.4.2 ps_1.3.0 pillar_1.3.1 [61] igraph_1.2.2 boot_1.3-20 seqinr_3.4-5 [64] reshape2_1.4.3 codetools_0.2-16 pkgload_1.0.2 [67] LearnBayes_2.15.1 glue_1.3.0 data.table_1.12.0 [70] remotes_2.0.2 httpuv_1.4.5.1 testthat_2.0.1 [73] gtable_0.2.0 purrr_0.2.5 future_1.10.0 [76] amap_0.8-16 assertthat_0.2.0 ggplot2_3.1.0 [79] mime_0.6 xtable_1.8-3 coda_0.19-2 [82] later_0.7.5 tibble_2.0.1 pbmcapply_1.3.1 [85] memoise_1.1.0 cluster_2.0.7-1 globals_0.12.4

Astahlke commented 5 years ago

Hi Thierry,

I've done a little more troubleshooting here. It seems like I can't replicate this issue on my local mac, but it is persistent on the HPC. Why is the imputation module throwing a missing data error when there aren't any NA in the data set?

Any ideas?

Thank you!

>   gc <- radiator::genomic_converter(data = miss.genlight,
+                                     output = "genlight",
+                                     imputation.method = "rf",
+                                     monomorphic.out = FALSE,
+                                 hierarchical.levels = "global",
+                                     verbose = TRUE)
#######################################################################
##################### radiator::genomic_converter #####################
#######################################################################
Function arguments and values:
Working directory: /mnt/ceph/stah3621/imputation
Input file: from global environment
Strata: no
Population levels: no
Population labels: no
Output format(s): tidy, genlight
Filename prefix: no
Filters:
Blacklist of individuals: no
Blacklist of genotypes: no
Whitelist of markers: no
monomorphic.out: FALSE
snp.ld: no
common.markers: TRUE
max.marker: no
pop.select: no
maf.thresholds: no

Imputations options:
imputation.method: rf
hierarchical.levels: global

parallel.core: 47

#######################################################################

Importing data

Number of markers missing in all individuals and removed: 1

Tidy genomic data:
    Number of markers: 500
    Number of chromosome/contig/scaffold: 1
    Number of individuals: 94

Preparing data for output

    Data is bi-allelic

#######################################################################
####################### grur::grur_imputations ########################
#######################################################################
Imputation method: rf
Hierarchical levels: global
On-the-fly-imputations options:
    number of trees to grow: 50
    minimum terminal node size: 1
    non-negative integer value used to specify random splitting: 10
    number of iterations: 10
Number of CPUs: 47
Note: If you have speed issues: follow radiator's vignette on parallel computing

Number of populations: 1
Number of individuals: 94
Number of markers: 500

Proportion of missing genotypes before imputations: 0.298319
On-the-fly-imputations using Random Forests algorithm
Imputations computed globally, take a break...
Adjusting REF/ALT alleles to account for imputations...
    generating REF/ALT dictionary
    integrating new genotype codings...

Proportion of missing genotypes after imputations: 0

Computation time: 8 sec
################## grur::grur_imputations completed ###################
Generating adegenet genlight object without imputation
Generating adegenet genlight object WITH imputations

Writing tidy data set:
radiator_data_20190125@1528.rad

Writing tidy data set:
radiator_data_20190125@1528.rad
############################### RESULTS ###############################
Data format of input: genlight
Biallelic data
Number of common markers: 500
Number of chromosome/contig/scaffold: 1
Number of individuals 94

Computation time: 12 sec
################ radiator::genomic_converter completed ################
Warning message:
In radiator::radiator_imputations_module(data = input, imputation.method = imputation.method,  :
  Missing data is still present in the dataset
    2 options:
    run the function again with hierarchical.levels = 'global'
    use common.markers = TRUE when using hierarchical.levels = 'strata'

> anyNA(as.matrix(gc$genlight.imputed))
[1] FALSE

From what I can tell, R and package issues are the same in the important ways:

On my local mac:

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.2

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] bindrcpp_0.2.2        psych_1.8.12          vegan_2.5-3          
 [4] lattice_0.20-38       permute_0.9-4         LEA_2.4.0            
 [7] tidyr_0.8.2           adegenet_2.1.1        ade4_1.7-13          
[10] randomForestSRC_2.8.0 radiator_0.0.21

And on the HPC. Could the locale variables have an impact?

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /opt/modules/devel/R/3.5.1/lib64/R/lib/libRblas.so
LAPACK: /opt/modules/devel/R/3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods
[7] base

other attached packages:
 [1] dplyr_0.7.8           LEA_2.4.0             randomForestSRC_2.8.0
 [4] bindrcpp_0.2.2        psych_1.8.12          vegan_2.5-3
 [7] lattice_0.20-38       permute_0.9-4         tidyr_0.8.2
[10] radiator_0.0.21       adegenet_2.1.1        ade4_1.7-13

thierrygosselin commented 5 years ago

Impossible for me to reproduce the error. The imputation module was moved out of radiator. It now reside inside grur only, because of cross-dependency issue to submit to CRAN. genomic_converter will be added to grur imputations module in the next release of grur, next week.

thierrygosselin / radiator

missing data is still present after imputation in RF #38