Open EveTC opened 3 years ago
Hi @EveTC , I suspect the issue is how missing data are handled. I've built the following example based on your above code.
> library(vcfR)
> data(vcfR_example)
> vcf
***** Object of Class vcfR *****
18 samples
1 CHROMs
2,533 variants
Object size: 3.2 Mb
8.497 percent missing data
***** ***** *****
> getPOS(vcf)[1:10]
[1] 2 246 549 668 765 780 989 1670 1692 1775
> gt <- extract.gt(vcf)
> table(gt[3, ])
0|0 1|1
2 1
We see that position (POS) 549 is the third variant in the file, so we can extract the genotypes and query variant (row) three). We see that there are two, phased, diploid genotypes. They are diploid because there are two integer alleles for each genotype. The genotypes are phased because the alleles are delimited with "|" instead of "/". We see a total of three genotypes even though we have 18 samples. This means the other samples were missing genotypes. Because we have 2 pf the 0|0 genotype we have a total of 4 of the 0 allele. Because we have one of the 1|1 genotypes we have a total of two of the 1 allele. I believe adegenet handles missing data as another allelic state. But I suggest you consult it's documentation. How to handle missing data is one of those important details that's easy to forget to pay attention to.
Does that make sense? Brian
Hi Brian,
Thank you for your explanation, I think I follow what you are saying.
How does vcfR
handle missing data when calculating Hs in the genetic_diff()
? How do I know that the missing data has been read in correctly?
Thanks again, Eve
Hi @EveTC , vcfR::genetic_diff()
simply ignores missing data. That's why the number of alleles is different for some variants even though the sample size is constant. This can be validated by looking at the "n_*" columns. Does this address your question?
Hi Brian,
Yes it does - thank you for your explanation. I have a better understanding now. Eve
Hello
I am somewhat confused by the output by
geneitc_diff()
, in particular the number of alleles in each population.Given the example data below:
To get more information about the oubject
vcf
I converted to a genInd object.So from this additional information I can see that there is a range of 1-5 alleles per locus and that the data is diploid. Therefore, from my understanding Supercontig_1.50:549 (CHROM:POS) has 2 alleles and it is 4 because (2x2(diploid)=4). Is this description correct?
If so, I am confused with my vcf output
If my vcf has a max range of 2 alleles per locus and is a diploid organism then surely the max number of alleles should be 4?? However the range of number of alleles is 0-40??
My appologies if this is a simplistic question, I am new to this sort of analysis. Any advice to shed light on this would be greatly appreciated.
Thank you