Open kdm9 opened 6 years ago
Hi Kevin,
I believe we already have this functionality. Please see the man page.
?vcfR::read.vcfR
The parameters nrows
and skip
should help you select rows (variants). The parameter cols
should allow you to specify columns (samples). Please give it a try and let me know if it was what you were looking for.
Brian
As far as I can tell, one can't extract a specific region query using this approach. I can take a given number of variants from a certain position in the VCF file (e.g. the 100k-200k'th variants), but not a chromosome segment (Chrom. 2, from position 100kbp-200kbp). The windowed analyses I perform require the latter.
You are correct. You can read a portion of the file in to RAM but the coordinates are in rows and columns. If you are interested in a specific region you'll have to make a guess as to where it may be, read that portion in (plus a little extra to be sure you get what you want), and subset to your region of interest using getCHROM()
, getPOS()
, and the square brackets []
. I advocate subsetting your data to one chromosome per file. If you're interested in chromosome 5 and your data are sorted in one file you'll have to go through chromosomes 1:4 to get to 5. If you have them in separate files you can go straight to the chromosome of interest. We have an example of making random subsets here. Your workflow should be similar but if you're targeting a specific region you'll have to use the tools mentioned above. Does that make sense? Does that get you where you want to go? Good luck!
We could certainly split the VCF into chromosomes using bcftools or tabix but it would be nice to be able to do this using vcfR.
Just wondering if it is it at all possible, as R will read the full VCF anyway?
As long as you can read the data into R (i.e., have enough RAM), you should be able to subset it in vcfR. In the below link I describe how to use the square brackets to subset your vcfR object.
https://knausb.github.io/vcfR_documentation/vcfR_object.html
To subset on chromosome you can use the following.
library(vcfR)
data("vcfR_test")
# Fabricate example.
chrom2 <- vcfR_test
chrom2@fix[,"CHROM"] <- "30"
chrom2 <- rbind2(vcfR_test, chrom2)
# Subset.
chrom2 <- chrom2[getCHROM(chrom2) == "30",]
Does that get you where you want to be?
At the moment I have my VCF split by chrom, so vcfR is fine to read it into R. It is an OK workaround. If possible I would prefer keep one VCF, and use vcfR to read chroms I need.
If we can read in the data we can subset it as described here. We can also use read.vcfR()
using the parameters nrows
and skip
to read in portions of a file. I think @zx8754 is proposing that we create a new parameter for read.vcfR()
that would be a vector of chromosome names so that we only read in certain chromosomes. Is that correct?
Yes, @knausb , ideally I would prefer input of either (or both) as below:
chrom = chromName
: e.g.: "chr1"
, or c("chr1", "chr2")
region = chromName:start-end
: e.g.: "chr1:1000-2000"
, or c("chr1:10000-20000", "chr2:100-2000")
I see an error in create.chromR() when subsetting to a chromosome
myChroms <- unique(getCHROM(vcf))
cnd <- getCHROM(vcf) == myChroms[1]
vcf[cnd, ] #error logical subscript too long
Hi @under-score, apologies for the slow response, but I've been on holiday. Hope you have too!
I do not see how your post is relevant to this thread on reading a specific region of a VCF file. If you feel it is could you please elaborate so that I see this? If it is not relevant I suggest you start a new issue.
You have not provided a minimal reproducible example which makes it very difficult to diagnose any issue. I've put some suggestions here. I would suggest something such as what follows.
library(vcfR)
# Fabricate data.
data("vcfR_test")
vcf1 <- vcfR_test
vcf2 <- vcfR_test
vcf2@fix[,"CHROM"] <- 22
vcf <- rbind2(vcf1, vcf2)
# Example.
myChroms <- unique(getCHROM(vcf))
cnd <- getCHROM(vcf) == myChroms[1]
vcf[cnd, ]
This seems to work fine. Can you modify it to reproduce the behavior you're experiencing? Thanks and happy new year!
Thank you - try my code with clinvar.vcf from ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37
Thank you for providing an example data set. I've used it to create a minimal reproducible example. This should work with your data (commented out) or the smaller example included with vcfR.
library(vcfR)
#>
#> ***** *** vcfR *** *****
#> This is vcfR 1.8.0
#> browseVignettes('vcfR') # Documentation
#> citation('vcfR') # Citation
#> ***** ***** ***** *****
# vcf <- read.vcfR("https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/clinvar.vcf.gz", verbose = FALSE)
data("vcfR_test")
vcf <- vcfR_test
vcf@gt <- matrix(nrow = 0, ncol = 0)
table(getCHROM(vcf), useNA = "always")
#>
#> 20 <NA>
#> 5 0
# Provided example.
myChroms <- unique(getCHROM(vcf))
cnd <- getCHROM(vcf) == myChroms[1]
vcf[cnd, ]
#> Error in x@gt[i, j, drop = FALSE]: (subscript) logical subscript too long
# Works for the "fix" slot.
vcf@fix[cnd,][1:4, 1:7]
#> CHROM POS ID REF ALT QUAL FILTER
#> [1,] "20" "14370" "rs6054257" "G" "A" "29" "PASS"
#> [2,] "20" "17330" NA "T" "A" "3" "q10"
#> [3,] "20" "1110696" "rs6040355" "A" "G,T" "67" "PASS"
#> [4,] "20" "1230237" NA "T" NA "47" "PASS"
# Does not work for the @gt slot because it is empty.
vcf@gt[cnd,]
#> Error in vcf@gt[cnd, ]: (subscript) logical subscript too long
Created on 2019-01-04 by the reprex package (v0.2.1)
The vcfR object includes a @fix
slot, which is a matrix, and a @gt
slot, which is also a matrix. The @gt
slot is optional, but I'm used to dealing with data that includes it. Your data is valid VCF data, but its just not what I'm used to dealing with. So I appreciate your post! The square brackets when used on a vcfR object attempt to subset both of these matrices. The error you've reported is because we're trying to subset a @gt matrix that has zero rows.
I think we now have a solution for @under-score's post. Its on the master branch now and can be installed as follows.
devtools::install_github(repo="knausb/vcfR")
This does not address the original post in this thread which I am still working on.
thank so much @knausb +1 for the solution
Hi
Thanks for an awesome package. We work on a series massive low-coverage WGS dataset (>100 million sites, ~1000 samples), which (from what I can tell) prevents in-memory computation on the whole dataset. We need to read both GT, PL, and AD fields for each genotype, and unfortunately other R packages with support for reading region subsets of a VCF (e.g. PopGenome, bigsnpr) don't support this. It would be great if vcfR supported this. Is there any chance that vcfR could provide this functionality?
Thanks heaps, Kevin