knausb / vcfR

Tools to work with variant call format files
240 stars 54 forks source link

Filtering for minor allele Frequency #204

Closed mattbareno closed 1 year ago

mattbareno commented 1 year ago

Hey! I was curious if there was an option to filter for minor allele frequency. Using the masker function i successfully filtered for a read depth and mapping quality threshold. The manual doesn't say anything about a maf option, even though a publication claimed to use vcfR for maf filtering (https://apsjournals.apsnet.org/doi/10.1094/MPMI-05-19-0131-R ; see methods).

Could you help me through this? Thank you,

knausb commented 1 year ago

You could try ?vcfR::maf() to get a vector of frequencies and filter on that. Does that accomplish your goal?

mattbareno commented 1 year ago

In order to filter using the maf function, i had to convert the maf(chrom) to a dataframe, but how do i replace the values of the original object with this filtered dataframe?

For example, to attempt to filter based on 3 conditions (minimum read depth, mapping quality, and frequency) i used the following code:

image

But i cannot figure out this last part.

p.s. im relatively new to R so sorry for obvious things i may be confusing

Best,

knausb commented 1 year ago

Hi, when in doubt, simplify. Try one thing at a time. How about the following for MAF?

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.14.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data(vcfR_example)
vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 1 CHROMs
#> 2,533 variants
#> Object size: 3.2 Mb
#> 8.497 percent missing data
#> *****        *****         *****

my_maf <- maf(vcf)
my_maf[1:3, ]
#>                      nAllele NA Count  Frequency
#> Supercontig_1.50_2        24  6     1 0.04166667
#> Supercontig_1.50_246      22  7     5 0.22727273
#> Supercontig_1.50_549       6 15     2 0.33333333
vcf <- vcf[my_maf[, 'Frequency'] > 0.014, ]
vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 1 CHROMs
#> 2,511 variants
#> Object size: 3.2 Mb
#> 8 percent missing data
#> *****        *****         *****

Created on 2023-03-21 with reprex v2.0.2

For DP and MQ I'd suggest something like the following.

https://knausb.github.io/vcfR_documentation/censoring_data.html

This will allow you to censor individual genotypes where the masker function works on row (variant)-wise summaries. I prefer working on the individual genotypes.

Does this provide you with a solution?

mattbareno commented 1 year ago

Yes thank you so much!