uqrmaie1 / admixtools

https://uqrmaie1.github.io/admixtools
74 stars 14 forks source link

Behavior affected by other R packages? #62

Closed diegovelizo closed 8 months ago

diegovelizo commented 8 months ago

Hi,

I have two conda environments where I have installed admixtools2. The F2 statistics calculated by an installation of Admixtools2 cannot be loaded by the other one. Specifically, the problem seems to be on how the order of the pairwise comparisons is determined.

For instance, calculating the F2 statistics with one conda installation and trying to read the results from a different one results in this error:

  >f2_from_precomp(f2_dir, pops = kept_pops, fst = FALSE) 
  Error in read_f2(dir, pops, pops2, type = type, remove_na = remove_na,  : 
  File /path/to/data/f2/Mbuti/MXL_f2.rds not found! You may have to recompute the f-statistics!

The F2 statistics between those two populations has actually been calculated, but its path is .../MXL/Mbuti_f2.rds (i.e. the order of the first and second population in the comparison is swapped).

I'm guessing it is related to how strings are sorted, which I think might be affected by functions that are imported from other packages.

Thanks, Diego

R sessions info:

  1. Conda environment that succeeds to read the F2 statistics (this is the one used to calculate the F2 statistics):
> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux

Matrix products: default
BLAS/LAPACK: /.../conda/envs/ggtree/lib/libopenblasp-r0.3.25.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: ...
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] igraph_1.5.1     phangorn_2.11.1  ape_5.7-1        lubridate_1.9.3 
 [5] forcats_1.0.0    stringr_1.5.1    dplyr_1.1.4      purrr_1.0.2     
 [9] readr_2.1.4      tidyr_1.3.0      tibble_3.2.1     ggplot2_3.4.4   
[13] tidyverse_2.0.0  admixtools_2.0.0

loaded via a namespace (and not attached):
 [1] utf8_1.2.4       generics_0.1.3   stringi_1.8.2    lattice_0.22-5  
 [5] hms_1.1.3        digest_0.6.33    magrittr_2.0.3   grid_4.3.2      
 [9] timechange_0.2.0 iterators_1.0.14 foreach_1.5.2    Matrix_1.6-4    
[13] fansi_1.0.5      scales_1.3.0     codetools_0.2-19 abind_1.4-5     
[17] cli_3.6.1        rlang_1.1.2      crayon_1.5.2     munsell_0.5.0   
[21] withr_2.5.2      tools_4.3.2      parallel_4.3.2   tzdb_0.4.0      
[25] colorspace_2.1-0 fastmatch_1.1-4  vctrs_0.6.5      R6_2.5.1        
[29] lifecycle_1.0.4  pkgconfig_2.0.3  pillar_1.9.0     gtable_0.3.4    
[33] glue_1.6.2       Rcpp_1.0.11      tidyselect_1.2.0 nlme_3.1-164    
[37] compiler_4.3.2   quadprog_1.5-8  
  1. Conda environment that fails to read the F2 statistics:
    
    > sessionInfo()
    R version 4.3.1 (2023-06-16)
    Platform: x86_64-conda-linux-gnu (64-bit)
    Running under: Red Hat Enterprise Linux

Matrix products: default BLAS/LAPACK: /.../conda/envs/admixtools/lib/libopenblasp-r0.3.24.so; LAPACK version 3.11.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: ... tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] igraph_1.5.1 phangorn_2.11.1 ape_5.7-1 lubridate_1.9.3 [5] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2
[9] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.4
[13] tidyverse_2.0.0 admixtools_2.0.0

loaded via a namespace (and not attached): [1] utf8_1.2.3 generics_0.1.3 stringi_1.7.12 lattice_0.21-9
[5] hms_1.1.3 digest_0.6.33 magrittr_2.0.3 grid_4.3.1
[9] timechange_0.2.0 iterators_1.0.14 foreach_1.5.2 Matrix_1.6-1.1
[13] fansi_1.0.5 scales_1.2.1 codetools_0.2-19 abind_1.4-5
[17] cli_3.6.1 rlang_1.1.1 crayon_1.5.2 munsell_0.5.0
[21] withr_2.5.1 tools_4.3.1 parallel_4.3.1 tzdb_0.4.0
[25] colorspace_2.1-0 fastmatch_1.1-4 vctrs_0.6.3 R6_2.5.1
[29] lifecycle_1.0.3 pkgconfig_2.0.3 pillar_1.9.0 gtable_0.3.4
[33] glue_1.6.2 Rcpp_1.0.11 tidyselect_1.2.0 nlme_3.1-163
[37] compiler_4.3.1 quadprog_1.5-8

uqrmaie1 commented 8 months ago

Yes, the lexicographic ordering of the population names can depend on system settings, and this can cause problems. I thought I had fixed this a while ago, but maybe that fix doesn't always work. I made another change now which hopefully fixes the problem.

Another thing which you could try without updating Admixtools 2 is to swap between different string orderings in R with Sys.setlocale('LC_COLLATE', 'C') and Sys.setlocale('LC_COLLATE', 'en_US.UTF-8')

That should change what you get when you run sort(c('MXL', 'Mbuti')), but I would only expect that affect how the files are being read in pretty old version of Admixtools 2.

diegovelizo commented 8 months ago

Thanks a lot for your reply!

I will close the issue since this is an already known behavior.