meyer-lab-cshl / plinkQC

R package for quality control of plink genetic datasets
Other
55 stars 28 forks source link

evaluate_check_ancestry fails with numeric IDs #14

Closed Rodin67 closed 5 years ago

Rodin67 commented 5 years ago

Describe the bug Using evaluate_check_ancestry or perIndividualQC functions with a sample containing only numeric IDs leads to an error.

To Reproduce Using a .fam file containing only numeric IDs.

Expected behavior Returning IDs with ancestry check failure.

Error messages "There are samples in the prefixMergedDataset that cannot be found in refSamples or XXX.fam"

Please complete the following information:

R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.3 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LC_CTYPE=fr_CA.UTF-8 LC_NUMERIC=C LC_TIME=fr_CA.UTF-8
[4] LC_COLLATE=fr_CA.UTF-8 LC_MONETARY=fr_CA.UTF-8 LC_MESSAGES=fr_CA.UTF-8
[7] LC_PAPER=fr_CA.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=fr_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] plinkQC_0.2.2 forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3 purrr_0.3.2
[6] readr_1.3.1 tidyr_0.8.3 tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.2.1

loaded via a namespace (and not attached): [1] tidyselect_0.2.5 haven_2.1.1 lattice_0.20-38 colorspace_1.4-1 [5] vctrs_0.2.0 generics_0.0.2 getopt_1.20.3 utf8_1.1.4
[9] rlang_0.4.0 pillar_1.4.2 glue_1.3.1 optparse_1.6.2
[13] withr_2.1.2 tweenr_1.0.1 bit64_0.9-7 modelr_0.1.5
[17] readxl_1.3.1 plyr_1.8.4 munsell_0.5.0 gtable_0.3.0
[21] cellranger_1.1.0 rvest_0.3.4 labeling_0.3 UpSetR_1.4.0
[25] fansi_0.4.0 broom_0.5.2 Rcpp_1.0.2 scales_1.0.0
[29] backports_1.1.4 jsonlite_1.6 farver_1.1.0 bit_1.1-14
[33] gridExtra_2.3 digest_0.6.20 ggforce_0.3.1 hms_0.5.1
[37] stringi_1.4.3 polyclip_1.10-0 grid_3.6.1 cowplot_1.0.0
[41] cli_1.1.0 tools_3.6.1 magrittr_1.5 lazyeval_0.2.2
[45] crayon_1.3.4 pkgconfig_2.0.2 zeallot_0.1.0 MASS_7.3-51.4
[49] data.table_1.12.2 xml2_1.2.2 lubridate_1.7.4 assertthat_0.2.1 [53] httr_1.4.1 rstudioapi_0.10 R6_2.4.0 nlme_3.1-141
[57] compiler_3.6.1

Additional context Adding to the function something like "mutate_all(samples, .funs = function(x) as.character(x))" after creating the samples data frame helps.

HannahVMeyer commented 5 years ago

Hi, thanks for reporting this.

I have created a minimal example where all ids in data.fam are numeric but I cannot reproduce the issue. The following shows str() of the data.frames within evaluate_check_ancestry:

str(refSamples)
'data.frame':   1184 obs. of  2 variables:
 $ IID: chr  "NA19919" "NA19916" "NA19835" "NA20282" ...
 $ Pop: chr  "ASW" "ASW" "ASW" "ASW" ...

str(samples)
'data.frame':   200 obs. of  2 variables:
 $ FID: int  26 125 162 169 147 152 187 17 153 5 ...
 $ IID: int  26 125 162 169 147 152 187 17 153 5 ...

str(pca_data)
'data.frame':   1384 obs. of  4 variables:
 $ FID: chr  "181" "182" "183" "184" ...
 $ IID: chr  "181" "182" "183" "184" ...
 $ PC1: num  0.006036 0.006859 -0.001683 0.000583 0.006763 ...
 $ PC2: num  0.01439 0.00602 -0.00773 -0.00948 0.00841 ...

If I understand correctly, you suspect that samples$IID being numeric causes your error message? I don't find this here. Do you have additional constraints? Could you provide an example dataset where this fails?

Thank you!

HannahVMeyer commented 5 years ago

I am closing this now as I cannot reproduce this issue. Feel free to re-open with example data that shows the issue.

Thanks