Read specific region of VCF

kdm9 commented 6 years ago

Hi

Thanks for an awesome package. We work on a series massive low-coverage WGS dataset (>100 million sites, ~1000 samples), which (from what I can tell) prevents in-memory computation on the whole dataset. We need to read both GT, PL, and AD fields for each genotype, and unfortunately other R packages with support for reading region subsets of a VCF (e.g. PopGenome, bigsnpr) don't support this. It would be great if vcfR supported this. Is there any chance that vcfR could provide this functionality?

Thanks heaps, Kevin

knausb commented 6 years ago

Hi Kevin,

I believe we already have this functionality. Please see the man page.

?vcfR::read.vcfR

The parameters nrows and skip should help you select rows (variants). The parameter cols should allow you to specify columns (samples). Please give it a try and let me know if it was what you were looking for.

Brian

kdm9 commented 6 years ago

As far as I can tell, one can't extract a specific region query using this approach. I can take a given number of variants from a certain position in the VCF file (e.g. the 100k-200k'th variants), but not a chromosome segment (Chrom. 2, from position 100kbp-200kbp). The windowed analyses I perform require the latter.

knausb commented 6 years ago

You are correct. You can read a portion of the file in to RAM but the coordinates are in rows and columns. If you are interested in a specific region you'll have to make a guess as to where it may be, read that portion in (plus a little extra to be sure you get what you want), and subset to your region of interest using getCHROM(), getPOS(), and the square brackets []. I advocate subsetting your data to one chromosome per file. If you're interested in chromosome 5 and your data are sorted in one file you'll have to go through chromosomes 1:4 to get to 5. If you have them in separate files you can go straight to the chromosome of interest. We have an example of making random subsets here. Your workflow should be similar but if you're targeting a specific region you'll have to use the tools mentioned above. Does that make sense? Does that get you where you want to go? Good luck!

zx8754 commented 6 years ago

We could certainly split the VCF into chromosomes using bcftools or tabix but it would be nice to be able to do this using vcfR.

Just wondering if it is it at all possible, as R will read the full VCF anyway?

knausb commented 6 years ago

As long as you can read the data into R (i.e., have enough RAM), you should be able to subset it in vcfR. In the below link I describe how to use the square brackets to subset your vcfR object.

https://knausb.github.io/vcfR_documentation/vcfR_object.html

To subset on chromosome you can use the following.

library(vcfR)
data("vcfR_test")
# Fabricate example.
chrom2 <- vcfR_test
chrom2@fix[,"CHROM"] <- "30"
chrom2 <- rbind2(vcfR_test, chrom2)
# Subset.
chrom2 <- chrom2[getCHROM(chrom2) == "30",]

Does that get you where you want to be?

zx8754 commented 6 years ago

At the moment I have my VCF split by chrom, so vcfR is fine to read it into R. It is an OK workaround. If possible I would prefer keep one VCF, and use vcfR to read chroms I need.

knausb commented 6 years ago

If we can read in the data we can subset it as described here. We can also use read.vcfR() using the parameters nrows and skip to read in portions of a file. I think @zx8754 is proposing that we create a new parameter for read.vcfR() that would be a vector of chromosome names so that we only read in certain chromosomes. Is that correct?

zx8754 commented 6 years ago

Yes, @knausb , ideally I would prefer input of either (or both) as below:

chrom = chromName: e.g.: "chr1", or c("chr1", "chr2")
region = chromName:start-end: e.g.: "chr1:1000-2000", or c("chr1:10000-20000", "chr2:100-2000")

under-score commented 5 years ago

I see an error in create.chromR() when subsetting to a chromosome

myChroms <- unique(getCHROM(vcf))
cnd <- getCHROM(vcf) == myChroms[1]
vcf[cnd, ] #error logical subscript too long

knausb commented 5 years ago

Hi @under-score, apologies for the slow response, but I've been on holiday. Hope you have too!

I do not see how your post is relevant to this thread on reading a specific region of a VCF file. If you feel it is could you please elaborate so that I see this? If it is not relevant I suggest you start a new issue.

You have not provided a minimal reproducible example which makes it very difficult to diagnose any issue. I've put some suggestions here. I would suggest something such as what follows.

library(vcfR)
# Fabricate data.
data("vcfR_test")
vcf1 <- vcfR_test
vcf2 <- vcfR_test
vcf2@fix[,"CHROM"] <- 22
vcf <- rbind2(vcf1, vcf2)
# Example.
myChroms <- unique(getCHROM(vcf))
cnd <- getCHROM(vcf) == myChroms[1]
vcf[cnd, ]

This seems to work fine. Can you modify it to reproduce the behavior you're experiencing? Thanks and happy new year!

under-score commented 5 years ago

Thank you - try my code with clinvar.vcf from ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37

knausb commented 5 years ago

Thank you for providing an example data set. I've used it to create a minimal reproducible example. This should work with your data (commented out) or the smaller example included with vcfR.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
# vcf <- read.vcfR("https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz", verbose = FALSE)
data("vcfR_test")
vcf <- vcfR_test
vcf@gt <- matrix(nrow = 0, ncol = 0)
table(getCHROM(vcf), useNA = "always")
#> 
#>   20 <NA> 
#>    5    0
# Provided example.
myChroms <- unique(getCHROM(vcf))
cnd <- getCHROM(vcf) == myChroms[1]
vcf[cnd, ]
#> Error in x@gt[i, j, drop = FALSE]: (subscript) logical subscript too long
# Works for the "fix" slot.
vcf@fix[cnd,][1:4, 1:7]
#>      CHROM POS       ID          REF ALT   QUAL FILTER
#> [1,] "20"  "14370"   "rs6054257" "G" "A"   "29" "PASS"
#> [2,] "20"  "17330"   NA          "T" "A"   "3"  "q10" 
#> [3,] "20"  "1110696" "rs6040355" "A" "G,T" "67" "PASS"
#> [4,] "20"  "1230237" NA          "T" NA    "47" "PASS"
# Does not work for the @gt slot because it is empty.
vcf@gt[cnd,]
#> Error in vcf@gt[cnd, ]: (subscript) logical subscript too long

^{Created on 2019-01-04 by the reprex package (v0.2.1)}

The vcfR object includes a @fix slot, which is a matrix, and a @gt slot, which is also a matrix. The @gt slot is optional, but I'm used to dealing with data that includes it. Your data is valid VCF data, but its just not what I'm used to dealing with. So I appreciate your post! The square brackets when used on a vcfR object attempt to subset both of these matrices. The error you've reported is because we're trying to subset a @gt matrix that has zero rows.

knausb commented 5 years ago

I think we now have a solution for @under-score's post. Its on the master branch now and can be installed as follows.

devtools::install_github(repo="knausb/vcfR")

This does not address the original post in this thread which I am still working on.

under-score commented 5 years ago

thank so much @knausb +1 for the solution

knausb / vcfR

Read specific region of VCF #107