Open bathycy opened 2 years ago
Hi @bathycy , In the VCF specification v4.3 section 1.6.1 in subsection "3. ID" it states that the ID column should be 'unique identifiers' for each variant, when available. I feel that the reason for your error is that your data includes non-unique values in the ID column. This can be addressed as follows.
library(vcfR)
#>
#> ***** *** vcfR *** *****
#> This is vcfR 1.12.0.9999
#> browseVignettes('vcfR') # Documentation
#> citation('vcfR') # Citation
#> ***** ***** ***** *****
#?vcfR
data("vcfR_test")
vcfR_test
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 5 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> ***** ***** *****
myID <- getID(vcfR_test)
length(unique(myID, incomparables = NA)) == length(myID)
#> [1] TRUE
vcf2 <- rbind2(vcfR_test, vcfR_test[1,])
vcf2
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 6 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> ***** ***** *****
myID <- getID(vcf2)
length(unique(myID, incomparables = NA)) == length(myID)
#> [1] FALSE
vcf3 <- vcf2[!duplicated(myID, incomparables = NA), ]
myID <- getID(vcf3)
length(unique(myID, incomparables = NA)) == length(myID)
#> [1] TRUE
Created on 2021-12-17 by the reprex package (v2.0.1)
Here I've loaded an example data set and validated that the ID column is unique. Note that missing values (in R = NA) are valid so they are handled here as 'incomparables'. I've then used rbind2()
to add a non-unique variant, and tested this again to show that the ID column is non-unique. The simplest path may be to omit the non-unique variants, as I have demonstrated, using the duplicated()
function. If you feel these duplicated variants are valuable you may want to instead develop a workflow that identifies these duplicated variants and make their IDs unique somehow, such as adding a suffix (e.g., 1, 2, 3, or a, b, c, ...).
Please let me know if this resolves your issue. Thanks! Brian
When I try to run vcfR on a vcf file I have I keep running into the same error when I try to extract the GT from the Genotype Section (Error in extract.gt(x = vcf, element = format_fields[i], as.numeric = coerce_numeric[i]) : ID column contains non-unique names). When I head the file it looks fine initially but I cant seem to run any other commands on it. Can you guys help me with this. [1] " Object of class 'vcfR' " [1] " Meta section " [1] "##fileformat=VCFv4.1" [1] "##FILTER=<ID=PASS,Description=\"All filters passed\">" [1] "##filedate=2019.12.2" [1] "##source=Minimac3" [1] "##contig="
[1] "##FILTER=<ID=GENOTYPED,Description=\"Marker was genotyped AND imputed\">"
[1] "First 6 rows."
[1]
[1] " Fixed section "
CHROM POS ID REF ALT QUAL FILTER
[1,] "8" "11740" "rs531589080" "G" "A" NA "PASS"
[2,] "8" "11774" "rs143233250" "A" "T" NA "PASS"
[3,] "8" "11788" "rs564896271" "C" "T" NA "PASS"
[4,] "8" "11789" "rs527808609" "G" "A" NA "PASS"
[5,] "8" "11816" "rs75979472" "T" "C" NA "PASS"
[6,] "8" "11879" "rs536257851" "A" "G" NA "PASS"
[1]
[1] " Genotype section "
FORMAT dnl407754_icv
[1,] "GT:DS" "0|0:0.002"
[2,] "GT:DS" "1|1:1.208"
[3,] "GT:DS" "0|0:0.019"
[4,] "GT:DS" "0|0:0.007"
[5,] "GT:DS" "1|1:1.232"
[6,] "GT:DS" "0|0:0.009"
[1] [1] "Unique GT formats:" [1] "GT:DS"
I would upload it but the file type isn't supported