Open xiangzhu opened 5 years ago
I've found that the gap between genotype data as it's stored and exchanged (e.g plink
,vcf/bcf
) and as it's structured for use by ldshrink
(the classic nxp "data matrix") can be a deceptively treacherous gap to cross. data.table
's fread
is a great tool for reading csv
-like data, and the vcf
is certainly csv
-like, but there are a couple nasty edge cases that can make parsing vcfs "by hand" a real pain (handling phasing, handling non bi-allelic sites, reading subsets of the data, etc). I think including examples where the data is read from vcf
/plink/gds
etc. would be super valuable. I think by adding documentation/examples of converting files to "data matrices" we can help out folks not familiar with the ins and outs of stat-gen file formats. I'm somewhat less comfortable providing this functionality as a part of the package as there are already packages like VcfArray
, vcfR
and snpStats
that offer this functionality (and do a better job than I could).
@CreRecombinase yes, I totally agree it is better to just have "documentation/examples of converting files to data matrices". However, these examples should show people how to convert file formats within R, not like what I did before by using vcftools
. I think data.table
allows us to do this, at least for vcf data.
I think having file format converting wrapper is conceptually ideal, but practically challenging -- vcf is just one commonly used file format for genotype/haplotype data -- if we write a wrapper for vcf, probably we will be asked to provide wrappers for the other formats like bgen (UKBB format, https://www.well.ox.ac.uk/~gav/bgen_format/index.html).
@CreRecombinase one thing related to this thread, what other genotype file formats you have worked on so far? I wonder if you could list them here as follows:
I think "Phase" in "Phase 1" and "Phase 3" refer to phases of the 1000 genomes project. I would consider both of those "VCF" (although the version of the VCF standard you might come across for Phase 1 data might be different than that for Phase 3 data). There are two other formats that I come across with any real frequency:
plink
's binary file format is maybe slightly less common than vcf
, but still quite common.bcftools
, the successor to vcftools
)After those two I'd put the formats that are what you might call "intermediate maturity":
gds
is a an excellent file format used by the equally excellent, well documented SeqArray
/SNPRelate
R packages. It doesn't seem to have wide adoption outside the University of Washington.scikit-allel
has an HDF5
-based format that's also great. I don't think anyone has written an R
wrapper to the library (although it should be pretty easy given how cross-platform HDF5
is; in theory it should be easy to read HDF5
data in pretty much any language)bgen
is the format used for the UK-biobank genotypes. It looks like a pretty cool format, but seems to be the effort of one person. There is an R package for working with bgen
data but it's pretty primitive. I believe plink
also supports bgen
.After that it's a pretty long tail. To name a few:
.gen
.gvcf
SeqSupport
/EigenH5
@CreRecombinase Thanks for sharing this! Now I agree more that we better stick to an "n x p" genotype data matrix as ldshrink
input, and then provide a list of vignettes showing how to prepare this data matrix from different file formats ....
Currently
ldshrink
assumes the input genotype/haplotype data are stored in an n-by-p numerical matrix, which is convenient from statisticians' perspective. However, public genotype/haplotype data from 1000 Genomes are stored in vcf format.In the past I first used
vcftools
to convert vcf data toIMPUTE2
format (which is indeed a p-by-n matrix), and then transposeIMPUTE2
-formatted data in R. See https://github.com/stephenslab/rss/blob/master/misc/import_1000g_vcf.sh.This two-step workflow is not so convenient (at least for statisticians): they have to learn a new program like
vcftools
before any LD-related operations inldshrink
.It seems that now we can use
data.table
(https://cran.r-project.org/web/packages/data.table) to directly convert vcf data to the n-by-p matrix in R. Here is an example: https://gist.github.com/cfljam/bc762f1d7b412df594ebc4219bac2d2b.Here is my own example.
The benefit of using
data.table
here is two-fold: i) users don't have to leave R and usevcftools
to get n-by-p genotype matrix from vcf data; ii)data.table
is a well-maintained and constantly-upgraded package that can handle large datasets efficiently (at least based on my past experiences).Hence, we can either add a wrapper that uses
data.table
to parse vcf forldshrink
users, or at minimum, we can simply provide a vignette showing how to usedata.table
to parse vcf.Finally there exists a package
vcfR
(https://cran.r-project.org/web/packages/vcfR) that might be relevant (but I have not used it much).