kausmees / GenoCAE

Convolutional autoencoder for genotype data
BSD 3-Clause "New" or "Revised" License
15 stars 10 forks source link

Convert EIGENSTRAT files to PLINK format #15

Closed richelbilderbeek closed 3 years ago

richelbilderbeek commented 3 years ago

Here I have converted the EIGENSTRAT example files to PLINK format to fix #11 and #13 .

The R script I used is below and also included within the commit history.

devtools::install_github("uqrmaie1/admixtools")

# Copy files to .geno and .ind and .snp
file.copy("HumanOrigins249_tiny.snp", "copy.snp")
file.copy("HumanOrigins249_tiny.fam", "copy.ind")
file.copy("HumanOrigins249_tiny.eigenstratgeno", "copy.geno")

admixtools::eigenstrat_to_plink(
  inpref = "copy",
  outpref = "plink",
  verbose = TRUE
)

file.remove("copy.snp")
file.remove("copy.ind")
file.remove("copy.geno")
kausmees commented 3 years ago

Great! I'm just thinking it may be better to name the files HumanOrigins249_tiny{.bed/.bim/.fam} so its clear what data set they actually contain (and that it's the same data set that is in EIGENSTRAT format).

richelbilderbeek commented 3 years ago

I'm just thinking it may be better to name the files HumanOrigins249_tiny{.bed/.bim/.fam}

@kausmees I understand that idea!

However, I am worried about naming conflicts between the two .fam files and just checked:

richel@N141CU:~/GitHubs/GenoCAE/example_tiny$ head HumanOrigins249_tiny.fam plink.fam 
==> HumanOrigins249_tiny.fam <==
BantuKenya HGDP01405 0 0 0 1
BantuKenya HGDP01408 0 0 0 1
BantuKenya HGDP01414 0 0 0 1
BantuKenya HGDP01417 0 0 0 1
BantuKenya HGDP01418 0 0 0 1
Biaka HGDP00454 0 0 0 1
Biaka HGDP00455 0 0 0 1
Biaka HGDP00457 0 0 0 1
Biaka HGDP00458 0 0 0 1
Biaka HGDP00459 0 0 0 1

==> plink.fam <==
BantuKenya  BantuKenya  0   0   HGDP01405   -9
BantuKenya  BantuKenya  0   0   HGDP01408   -9
BantuKenya  BantuKenya  0   0   HGDP01414   -9
BantuKenya  BantuKenya  0   0   HGDP01417   -9
BantuKenya  BantuKenya  0   0   HGDP01418   -9
Biaka   Biaka   0   0   HGDP00454   -9
Biaka   Biaka   0   0   HGDP00455   -9
Biaka   Biaka   0   0   HGDP00457   -9
Biaka   Biaka   0   0   HGDP00458   -9
Biaka   Biaka   0   0   HGDP00459   -9

Apparently, the .fam files are different.

If you still thing using the same HumanOrigins249_tiny prefix, I'll change it.

richelbilderbeek commented 3 years ago

Uh, maybe the PLINK file conversion simply went sideways :confused:, there was a warning when I run the R script. I'll investigate :monocle_face:

kausmees commented 3 years ago

I was able to find some old conversion scripts (If you're interested, I used the convertf program from makers of EIGENSTRAT https://reich.hms.harvard.edu/software/InputFileFormats ) and did the conversion :) Added the files in commit c41065b

richelbilderbeek commented 3 years ago

@kausmees thanks so much! Boy, that I will enjoy your work right away; a great start of a day :-)