knausb / vcfR

Tools to work with variant call format files
240 stars 54 forks source link

Error: ID column contains non-unique names #193

Open bathycy opened 2 years ago

bathycy commented 2 years ago

When I try to run vcfR on a vcf file I have I keep running into the same error when I try to extract the GT from the Genotype Section (Error in extract.gt(x = vcf, element = format_fields[i], as.numeric = coerce_numeric[i]) : ID column contains non-unique names). When I head the file it looks fine initially but I cant seem to run any other commands on it. Can you guys help me with this. [1] " Object of class 'vcfR' " [1] " Meta section " [1] "##fileformat=VCFv4.1" [1] "##FILTER=<ID=PASS,Description=\"All filters passed\">" [1] "##filedate=2019.12.2" [1] "##source=Minimac3" [1] "##contig=" [1] "##FILTER=<ID=GENOTYPED,Description=\"Marker was genotyped AND imputed\">" [1] "First 6 rows." [1] [1] " Fixed section " CHROM POS ID REF ALT QUAL FILTER [1,] "8" "11740" "rs531589080" "G" "A" NA "PASS" [2,] "8" "11774" "rs143233250" "A" "T" NA "PASS" [3,] "8" "11788" "rs564896271" "C" "T" NA "PASS" [4,] "8" "11789" "rs527808609" "G" "A" NA "PASS" [5,] "8" "11816" "rs75979472" "T" "C" NA "PASS" [6,] "8" "11879" "rs536257851" "A" "G" NA "PASS" [1] [1] " Genotype section " FORMAT dnl407754_icv [1,] "GT:DS" "0|0:0.002"
[2,] "GT:DS" "1|1:1.208"
[3,] "GT:DS" "0|0:0.019"
[4,] "GT:DS" "0|0:0.007"
[5,] "GT:DS" "1|1:1.232"
[6,] "GT:DS" "0|0:0.009"
[1] [1] "Unique GT formats:" [1] "GT:DS"

I would upload it but the file type isn't supported

knausb commented 2 years ago

Hi @bathycy , In the VCF specification v4.3 section 1.6.1 in subsection "3. ID" it states that the ID column should be 'unique identifiers' for each variant, when available. I feel that the reason for your error is that your data includes non-unique values in the ID column. This can be addressed as follows.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0.9999 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
#?vcfR
data("vcfR_test")
vcfR_test
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 5 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> *****        *****         *****

myID <- getID(vcfR_test)
length(unique(myID, incomparables = NA)) == length(myID)
#> [1] TRUE

vcf2 <- rbind2(vcfR_test, vcfR_test[1,])
vcf2
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 6 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> *****        *****         *****
myID <- getID(vcf2)
length(unique(myID, incomparables = NA)) == length(myID)
#> [1] FALSE

vcf3 <- vcf2[!duplicated(myID, incomparables = NA), ]
myID <- getID(vcf3)
length(unique(myID, incomparables = NA)) == length(myID)
#> [1] TRUE

Created on 2021-12-17 by the reprex package (v2.0.1)

Here I've loaded an example data set and validated that the ID column is unique. Note that missing values (in R = NA) are valid so they are handled here as 'incomparables'. I've then used rbind2() to add a non-unique variant, and tested this again to show that the ID column is non-unique. The simplest path may be to omit the non-unique variants, as I have demonstrated, using the duplicated() function. If you feel these duplicated variants are valuable you may want to instead develop a workflow that identifies these duplicated variants and make their IDs unique somehow, such as adding a suffix (e.g., 1, 2, 3, or a, b, c, ...).

Please let me know if this resolves your issue. Thanks! Brian