stephenslab / ldshrink

ldshrink: a one-stop R package for shrinkage estimation of linkage disequilibrium
12 stars 1 forks source link

Use `data.table` package to parse vcf #31

Open xiangzhu opened 5 years ago

xiangzhu commented 5 years ago

Currently ldshrink assumes the input genotype/haplotype data are stored in an n-by-p numerical matrix, which is convenient from statisticians' perspective. However, public genotype/haplotype data from 1000 Genomes are stored in vcf format.

In the past I first used vcftools to convert vcf data to IMPUTE2 format (which is indeed a p-by-n matrix), and then transpose IMPUTE2-formatted data in R. See https://github.com/stephenslab/rss/blob/master/misc/import_1000g_vcf.sh.

This two-step workflow is not so convenient (at least for statisticians): they have to learn a new program like vcftools before any LD-related operations in ldshrink.

It seems that now we can use data.table (https://cran.r-project.org/web/packages/data.table) to directly convert vcf data to the n-by-p matrix in R. Here is an example: https://gist.github.com/cfljam/bc762f1d7b412df594ebc4219bac2d2b.

Here is my own example.

> suppressPackageStartupMessages(library(data.table))
> my_dt <- data.table::fread(cmd="zcat B_CELL_NAIVE.vcf.gz")

|--------------------------------------------------|
|==================================================|
>
> dim(my_dt)
[1] 215708754         8
> head(my_dt)
   #CHROM       POS          ID REF ALT QUAL FILTER
1:  chr15 101094127  rs12904576   T   C    .   PASS
2:  chr15 101114640  rs36053285   T   C    .   PASS
3:  chr15  45685616 rs796721871  TA   T    .   PASS
4:  chr15  45670316   rs2114501   G   A    .   PASS
5:  chr15  45618730  rs59889118   G   A    .   PASS
6:  chr15  45620612   rs1980288   T   C    .   PASS
                                                                                                               INFO
1:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=3.714e-27;Beta=-1.10;Statistic=-15.72;FDR=2.681e-20
2:   Gene=ENSG00000270127.1;GeneSymbol=ENSG00000270127.1;Pvalue=2.649e-24;Beta=-1.09;Statistic=-14.16;FDR=3.412e-18
3: Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=2.614e-23;Beta=-1.10;Statistic=-13.63;FDR=3.412e-18
4:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.007e-23;Beta=-1.09;Statistic=-13.6;FDR=3.412e-18
5:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18
6:  Gene=ENSG00000171766.11;GeneSymbol=ENSG00000171766.11;Pvalue=3.64e-23;Beta=-1.09;Statistic=-13.56;FDR=3.412e-18

The benefit of using data.table here is two-fold: i) users don't have to leave R and use vcftools to get n-by-p genotype matrix from vcf data; ii) data.table is a well-maintained and constantly-upgraded package that can handle large datasets efficiently (at least based on my past experiences).

Hence, we can either add a wrapper that uses data.table to parse vcf for ldshrink users, or at minimum, we can simply provide a vignette showing how to use data.table to parse vcf.

Finally there exists a package vcfR (https://cran.r-project.org/web/packages/vcfR) that might be relevant (but I have not used it much).

CreRecombinase commented 5 years ago

I've found that the gap between genotype data as it's stored and exchanged (e.g plink,vcf/bcf) and as it's structured for use by ldshrink (the classic nxp "data matrix") can be a deceptively treacherous gap to cross. data.table's fread is a great tool for reading csv-like data, and the vcf is certainly csv-like, but there are a couple nasty edge cases that can make parsing vcfs "by hand" a real pain (handling phasing, handling non bi-allelic sites, reading subsets of the data, etc). I think including examples where the data is read from vcf/plink/gds etc. would be super valuable. I think by adding documentation/examples of converting files to "data matrices" we can help out folks not familiar with the ins and outs of stat-gen file formats. I'm somewhat less comfortable providing this functionality as a part of the package as there are already packages like VcfArray, vcfR and snpStats that offer this functionality (and do a better job than I could).

xiangzhu commented 5 years ago

@CreRecombinase yes, I totally agree it is better to just have "documentation/examples of converting files to data matrices". However, these examples should show people how to convert file formats within R, not like what I did before by using vcftools. I think data.table allows us to do this, at least for vcf data.

I think having file format converting wrapper is conceptually ideal, but practically challenging -- vcf is just one commonly used file format for genotype/haplotype data -- if we write a wrapper for vcf, probably we will be asked to provide wrappers for the other formats like bgen (UKBB format, https://www.well.ox.ac.uk/~gav/bgen_format/index.html).

xiangzhu commented 5 years ago

@CreRecombinase one thing related to this thread, what other genotype file formats you have worked on so far? I wonder if you could list them here as follows:

CreRecombinase commented 5 years ago

I think "Phase" in "Phase 1" and "Phase 3" refer to phases of the 1000 genomes project. I would consider both of those "VCF" (although the version of the VCF standard you might come across for Phase 1 data might be different than that for Phase 3 data). There are two other formats that I come across with any real frequency:

After those two I'd put the formats that are what you might call "intermediate maturity":

After that it's a pretty long tail. To name a few:

xiangzhu commented 5 years ago

@CreRecombinase Thanks for sharing this! Now I agree more that we better stick to an "n x p" genotype data matrix as ldshrink input, and then provide a list of vignettes showing how to prepare this data matrix from different file formats ....