knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

can I use masker on INFO field values? #148

Closed splaisan closed 4 years ago

splaisan commented 4 years ago

I would like to imitate the GATK genotyper filtering using vcfR but masker seems to be tailored to FORMAT fields. Is there a way to filter on INFO fields and if yes could you please share a demo command Thanks

here is the GATK hard filtering logic and the corresponding INFO fields are present QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 3.0 || QUAL < 30

knausb commented 4 years ago

Hi @splaisan ,

I'm not sure I understand your goal here. The GATK is a dynamic project, so their recommendations change through time and I'm not sure where they currently are. We can build an example though.

Below is an example of working with the INFO data

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data("vcfR_test")
vcfR_test
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 5 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> *****        *****         *****

getINFO(vcfR_test)
#> [1] "NS=3;DP=14;AF=0.5;DB;H2"           "NS=3;DP=11;AF=0.017"              
#> [3] "NS=2;DP=10;AF=0.333,0.667;AA=T;DB" "NS=3;DP=13;AA=T"                  
#> [5] "NS=3;DP=9;AA=G"
vcf2 <- vcfR_test[as.numeric(extract.info(vcfR_test, element = "DP")) > 10, ]
vcf2
#> ***** Object of Class vcfR *****
#> 3 samples
#> 1 CHROMs
#> 3 variants
#> Object size: 0 Mb
#> 0 percent missing data
#> *****        *****         *****
vcf2@fix
#>      CHROM POS       ID          REF ALT QUAL FILTER
#> [1,] "20"  "14370"   "rs6054257" "G" "A" "29" "PASS"
#> [2,] "20"  "17330"   NA          "T" "A" "3"  "q10" 
#> [3,] "20"  "1230237" NA          "T" NA  "47" "PASS"
#>      INFO                     
#> [1,] "NS=3;DP=14;AF=0.5;DB;H2"
#> [2,] "NS=3;DP=11;AF=0.017"    
#> [3,] "NS=3;DP=13;AA=T"

Created on 2019-12-09 by the reprex package (v0.3.0)

Personally, I don't work with the INFO column much. This is because these are summaries over all the samples. If you have a few low quality samples they will drive down these summaries. I prefer to work with the data associated with each individual genotype.

Please let me know if that gets you what you need. Or if we need to expand the example.

splaisan commented 4 years ago

your command syntax worked out just fine, thanks a lot for your help

knausb commented 4 years ago

Thank you for reporting your outcome!