Use `data.table` package to parse vcf

xiangzhu commented 5 years ago

Currently ldshrink assumes the input genotype/haplotype data are stored in an n-by-p numerical matrix, which is convenient from statisticians' perspective. However, public genotype/haplotype data from 1000 Genomes are stored in vcf format.

In the past I first used vcftools to convert vcf data to IMPUTE2 format (which is indeed a p-by-n matrix), and then transpose IMPUTE2-formatted data in R. See https://github.com/stephenslab/rss/blob/master/misc/import_1000g_vcf.sh.

This two-step workflow is not so convenient (at least for statisticians): they have to learn a new program like vcftools before any LD-related operations in ldshrink.

It seems that now we can use data.table (https://cran.r-project.org/web/packages/data.table) to directly convert vcf data to the n-by-p matrix in R. Here is an example: https://gist.github.com/cfljam/bc762f1d7b412df594ebc4219bac2d2b.

Here is my own example.

> suppressPackageStartupMessages(library(data.table))
> my_dt <- data.table::fread(cmd="zcat B_CELL_NAIVE.vcf.gz")

|--------------------------------------------------|
|==================================================|
>
> dim(my_dt)
[1] 215708754         8
> head(my_dt)
   #CHROM       POS          ID REF ALT QUAL FILTER
1:  chr15 101094127  rs12904576   T   C    .   PASS
2:  chr15 101114640  rs36053285   T   C    .   PASS
3:  chr15  45685616 rs796721871  TA   T    .   PASS
4:  chr15  45670316   rs2114501   G   A    .   PASS
5:  chr15  45618730  rs59889118   G   A    .   PASS
6:  chr15  45620612   rs1980288   T   C    .   PASS
                                                                                                               INFO
1:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=3.714e-27;Beta=-1.10;Statistic=-15.72;FDR=2.681e-20
2:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=2.649e-24;Beta=-1.09;Statistic=-14.16;FDR=3.412e-18
3: Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=2.614e-23;Beta=-1.10;Statistic=-13.63;FDR=3.412e-18
4:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.007e-23;Beta=-1.09;Statistic=-13.6;FDR=3.412e-18
5:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18
6:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18

The benefit of using data.table here is two-fold: i) users don't have to leave R and use vcftools to get n-by-p genotype matrix from vcf data; ii) data.table is a well-maintained and constantly-upgraded package that can handle large datasets efficiently (at least based on my past experiences).

Hence, we can either add a wrapper that uses data.table to parse vcf for ldshrink users, or at minimum, we can simply provide a vignette showing how to use data.table to parse vcf.

Finally there exists a package vcfR (https://cran.r-project.org/web/packages/vcfR) that might be relevant (but I have not used it much).

CreRecombinase commented 5 years ago

I've found that the gap between genotype data as it's stored and exchanged (e.g plink,vcf/bcf) and as it's structured for use by ldshrink (the classic nxp "data matrix") can be a deceptively treacherous gap to cross. data.table's fread is a great tool for reading csv-like data, and the vcf is certainly csv-like, but there are a couple nasty edge cases that can make parsing vcfs "by hand" a real pain (handling phasing, handling non bi-allelic sites, reading subsets of the data, etc). I think including examples where the data is read from vcf/plink/gds etc. would be super valuable. I think by adding documentation/examples of converting files to "data matrices" we can help out folks not familiar with the ins and outs of stat-gen file formats. I'm somewhat less comfortable providing this functionality as a part of the package as there are already packages like VcfArray, vcfR and snpStats that offer this functionality (and do a better job than I could).

xiangzhu commented 5 years ago

@CreRecombinase yes, I totally agree it is better to just have "documentation/examples of converting files to data matrices". However, these examples should show people how to convert file formats within R, not like what I did before by using vcftools. I think data.table allows us to do this, at least for vcf data.

I think having file format converting wrapper is conceptually ideal, but practically challenging -- vcf is just one commonly used file format for genotype/haplotype data -- if we write a wrapper for vcf, probably we will be asked to provide wrappers for the other formats like bgen (UKBB format, https://www.well.ox.ac.uk/~gav/bgen_format/index.html).

xiangzhu commented 5 years ago

@CreRecombinase one thing related to this thread, what other genotype file formats you have worked on so far? I wonder if you could list them here as follows:

1000 Genomes Phase 1: vcf
1000 Genomes Phase 3: vcf

CreRecombinase commented 5 years ago

I think "Phase" in "Phase 1" and "Phase 3" refer to phases of the 1000 genomes project. I would consider both of those "VCF" (although the version of the VCF standard you might come across for Phase 1 data might be different than that for Phase 3 data). There are two other formats that I come across with any real frequency:

Plink/PED, plink's binary file format is maybe slightly less common than vcf, but still quite common.
BCF or Binary VCF. Supposed to have better performance characteristics compared to VCF (also supported by bcftools, the successor to vcftools)

After those two I'd put the formats that are what you might call "intermediate maturity":

gds is a an excellent file format used by the equally excellent, well documented SeqArray/SNPRelate R packages. It doesn't seem to have wide adoption outside the University of Washington.
The python package scikit-allel has an HDF5-based format that's also great. I don't think anyone has written an R wrapper to the library (although it should be pretty easy given how cross-platform HDF5is; in theory it should be easy to read HDF5 data in pretty much any language)
bgen is the format used for the UK-biobank genotypes. It looks like a pretty cool format, but seems to be the effort of one person. There is an R package for working with bgen data but it's pretty primitive. I believe plink also supports bgen.

After that it's a pretty long tail. To name a few:

SNP2H5
hail
Tomahawk / Tachyon
.gen
.gvcf
My own SeqSupport/EigenH5

xiangzhu commented 5 years ago

@CreRecombinase Thanks for sharing this! Now I agree more that we better stick to an "n x p" genotype data matrix as ldshrink input, and then provide a list of vignettes showing how to prepare this data matrix from different file formats ....

stephenslab / ldshrink

Use `data.table` package to parse vcf #31