knausb / vcfR

Tools to work with variant call format files
246 stars 55 forks source link

Rename Genotype information #199

Closed Flavjack closed 2 years ago

Flavjack commented 2 years ago

It is possible to rename headers in Genotype information (@gt)?

I tried importing the vcf file with vcfR2tidy() and after manipulating the Genotype information and get it. But I was not able to re-export it as a new vcf file. Do you have any hints? Thanks

knausb commented 2 years ago

Hi @Flavjack , I would not consider the VCF specification to be a form of "tidy data". The "tidy" functions in vcfR are there to help you work with VCF data if you would like to convert them to a tidy format. However, this conversion should be considered a departure from the VCF specification. If your goal is to produce a VCF compliant file I would suggest you avoid the tidy functions. If you're interested in producing a VCF compliant file I would suggest you make as few manipulations as possible. This is because every manipulation you make provides an opportunity for you to deviate from the VCF specification. I'd suggest something like the following.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0.9999 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data("vcfR_test")
vcfR_test
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 5 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> *****        *****         *****

vcfR_test@gt
#>      FORMAT        NA00001          NA00002          NA00003       
#> [1,] "GT:GQ:DP:HQ" "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
#> [2,] "GT:GQ:DP:HQ" "0|0:49:3:58,50" "0|1:3:5:65,3"   "0/0:41:3"    
#> [3,] "GT:GQ:DP:HQ" "1|2:21:6:23,27" "2|1:2:0:18,2"   "2/2:35:4"    
#> [4,] "GT:GQ:DP:HQ" "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"    
#> [5,] "GT:GQ:DP"    "0/1:35:4"       "0/2:17:2"       "1/1:40:3"
colnames(vcfR_test@gt)[-1] <- paste("newName", 1:3, sep = "")
vcfR_test@gt
#>      FORMAT        newName1         newName2         newName3      
#> [1,] "GT:GQ:DP:HQ" "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
#> [2,] "GT:GQ:DP:HQ" "0|0:49:3:58,50" "0|1:3:5:65,3"   "0/0:41:3"    
#> [3,] "GT:GQ:DP:HQ" "1|2:21:6:23,27" "2|1:2:0:18,2"   "2/2:35:4"    
#> [4,] "GT:GQ:DP:HQ" "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"    
#> [5,] "GT:GQ:DP"    "0/1:35:4"       "0/2:17:2"       "1/1:40:3"

Created on 2022-07-26 by the reprex package (v2.0.1)

Note that the "FORMAT" column must be named "FORMAT" if you want your data to be VCF compliant. Does this address your issue?

Flavjack commented 2 years ago

@knausb thanks for the explanation and the REPREX. The explanation solved my question. Many thanks :)