vatlab / varianttools

software tool for the manipulation, annotation, selection, and analysis of variants in the context of next-gen sequencing analysis
https://vatlab.github.io/vat-docs/
GNU General Public License v3.0
31 stars 4 forks source link

Importing individual samples without 0/0 GT #63

Closed kromanenkov closed 6 years ago

kromanenkov commented 6 years ago

Hello!

I have some VCF files (one sample in a file) with only 0/1 and 1/1 GT entries (no 0/0 GT). When I import them into VariantTools and run association tests, the resulting genotypes matrix consists only of 1, 2 and NA. Also during association testing many samples are discarded due to missing genotype info - AFAIU because 0/0 GT entries are not explicitly present in VCF files - which drastically reduce dimension of the genotype matrix.

So is there a rule of thumb for dealing with such VCF files? I noticed that datasets using in VariantTools tutorials contains 0/0 GT entries. So how can I transfer my data to such format? Or maybe there is a option in VariantTools to not treat such variants as missing genotype info?

Thanks

BoPeng commented 6 years ago

There is an option treating_missing_as_wildtype for this. Please check the doc for details.

kromanenkov commented 6 years ago

Thanks for your answer, Prof. Peng!

I also have another question about this option: documentation says that besides converting missing genotypes it converts removed low-quality genotypes as well. Does it mean that previously removed variants (using vtools remove variants ...) would be converted too? Or it concerns only filtering and selecting variants by using vtools associate options?

gaow commented 6 years ago

@kromanenkov vtools remove variants will indeed remove the entire variant site from all samples in the data. But the filtering criteria here is also on the variant level. For low quality genotype calls, vtools remove genotype will mark them as missing but they will still present in the data unless the entire variant site is missing. There are 2 things you can do from here:

  1. To permanently remove these sites, you need to additionally compute missing rate after vtools remove genotype, then use vtools remove variants conditional on missing rate
  2. Or, vtools associate provides an on-the-fly method to filter variants / samples at gene level based on the degree of missing genotypes. Unlike 1, it does not change the original data.