knausb / vcfR

Tools to work with variant call format files
249 stars 54 forks source link

Adding INFO tags on the basis information in the INFO tags #180

Closed misrak closed 3 years ago

misrak commented 3 years ago

Hello,

I was wondering if we can add INFO tags to an object on the basis of an already existing INFO tag. For example, a vcf contains an INFO tag such as 'GENEINFO' on the basis what 'GENEINFO' equals I want to add more INFO tags such as 'Gene_aliases' or 'ExAC_pLI' mostly the information present in dbNSFP${version}_gene.gz

In your examples, I can filter a file but adding something I could not see. Do you think vcfR can perform such a thing?

Thank you.

knausb commented 3 years ago

Hello,

I'm not familiar with this sort of data. So perhaps I do not understand. But I think this can be done with paste().

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data("vcfR_test")
vcfR_test@fix[, "INFO"]
#> [1] "NS=3;DP=14;AF=0.5;DB;H2"           "NS=3;DP=11;AF=0.017"              
#> [3] "NS=2;DP=10;AF=0.333,0.667;AA=T;DB" "NS=3;DP=13;AA=T"                  
#> [5] "NS=3;DP=9;AA=G"
vcfR_test@fix[, "INFO"] <- paste(vcfR_test@fix[, "INFO"], ";Gene_aliases=", 1:5, sep = "")
vcfR_test@fix[, "INFO"]
#> [1] "NS=3;DP=14;AF=0.5;DB;H2;Gene_aliases=1"          
#> [2] "NS=3;DP=11;AF=0.017;Gene_aliases=2"              
#> [3] "NS=2;DP=10;AF=0.333,0.667;AA=T;DB;Gene_aliases=3"
#> [4] "NS=3;DP=13;AA=T;Gene_aliases=4"                  
#> [5] "NS=3;DP=9;AA=G;Gene_aliases=5"

Created on 2021-03-23 by the reprex package (v1.0.0)

Does that help?

misrak commented 3 years ago

It does. As we adding the information to the INFO column we have to update the header also. Is it done the same by using paste?

knausb commented 3 years ago

In VCF data there is a "meta" section where you can document your flags. In a vcfR object this region is a character vector. So you should add a new element for each flag.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data("vcfR_test")
class(vcfR_test@meta)
#> [1] "character"
vcfR_test@meta
#>  [1] "##fileformat=VCFv4.3"                                                                                                  
#>  [2] "##fileDate=20090805"                                                                                                   
#>  [3] "##source=myImputationProgramV3.1"                                                                                      
#>  [4] "##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta"                                                      
#>  [5] "##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species=\"Homo sapiens\",taxonomy=x>"
#>  [6] "##phasing=partial"                                                                                                     
#>  [7] "##INFO=<ID=NS,Number=1,Type=Integer,Description=\"Number of Samples With Data\">"                                      
#>  [8] "##INFO=<ID=DP,Number=1,Type=Integer,Description=\"Total Depth\">"                                                      
#>  [9] "##INFO=<ID=AF,Number=A,Type=Float,Description=\"Allele Frequency\">"                                                   
#> [10] "##INFO=<ID=AA,Number=1,Type=String,Description=\"Ancestral Allele\">"                                                  
#> [11] "##INFO=<ID=DB,Number=0,Type=Flag,Description=\"dbSNP membership, build 129\">"                                         
#> [12] "##INFO=<ID=H2,Number=0,Type=Flag,Description=\"HapMap2 membership\">"                                                  
#> [13] "##FILTER=<ID=q10,Description=\"Quality below 10\">"                                                                    
#> [14] "##FILTER=<ID=s50,Description=\"Less than 50% of samples have data\">"                                                  
#> [15] "##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">"                                                        
#> [16] "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">"                                               
#> [17] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Read Depth\">"                                                     
#> [18] "##FORMAT=<ID=HQ,Number=2,Type=Integer,Description=\"Haplotype Quality\">"

Created on 2021-03-26 by the reprex package (v1.0.0)

misrak commented 3 years ago

Okay thank you!