knausb / vcfR

Tools to work with variant call format files
240 stars 54 forks source link

sex determination #182

Open AnjaliC4 opened 3 years ago

AnjaliC4 commented 3 years ago

Hi, Thanks for this very cool tool!

Could you please suggest a way to determine sex of every sample in vcfR? One approach I know is heterozygosity rate on chrX. However, not sure how to calclaute that in vcfR- would appreciate your advise.

In addition, could you please tell me the behavior of extract.gt(element="PL", as.numeric="TRUE"). I get numeric values but not sure how they go from e.g. 0:20:30 to one number. Thanks a lot!

knausb commented 3 years ago

Hi Anjali,

I'm afraid I do not know how to respond to your question about sex determination. In mammals we discuss X/Y sex chromosomes. In fungi we discuss mating type loci, and in oomycetes the genetic determination loci are uncharacterized. So I do not know what you are looking for. Do you work in human?

I also do not understand you second question. VCF data varies from the software that creates it. I would think that "PL" would be "phred scaled likelihood" but I don't know because I do not have access to your file. If I'm correct then I think that a numeric matrix would be correct.

Thanks! Brian

On Sat, Apr 10, 2021 at 3:43 PM Anjali Chawla @.***> wrote:

Hi, Thanks for this very cool tool!

Is there a way to determine sex of every sample in vcfR? For example, how can I calculate heterozygosity rate on chrX?

In addition, could you please tell me the what extract.gt(element="PL", as.numeric="TRUE") would go from PL values to numeric.

Thanks a lot!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/knausb/vcfR/issues/182, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLQWXJFA4NHXGHHNBCXCUTTIDICLANCNFSM42W67ORA .

AnjaliC4 commented 3 years ago

Hi Brian,

I am working with humans. I am trying to figure out a way to determine chrX heterozygosity using "PL" which as you guessed, is absolutely correct, is the "Phred scaled likelihood." Yes, it does produce a numeric matrix, however, PL values are usually in for exmaple 0:20:40 format, corresponding to 0/0:0/1:1/1 genotypes. I am unsure how "extract.gt(element="PL", as.numeric="TRUE")" converts 0:20:40 to a number? Is it the lowest number that gets extracted as that represents the genotype likelihood?

Thank you Anjali

knausb commented 3 years ago

Hi AnjaliC4,

I think you want something like the following.

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****

data("vcfR_example")
vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 1 CHROMs
#> 2,533 variants
#> Object size: 3.2 Mb
#> 8.497 percent missing data
#> *****        *****         *****
vcf@gt[1:3, 1:4]
#>      FORMAT           BL2009P4_us23               DDR7602                  
#> [1,] "GT:AD:DP:GQ:PL" "0|0:62,0:62:99:0,190,2835" "0|0:12,0:12:39:0,39,585"
#> [2,] "GT:AD:DP:GQ:PL" "1|0:5,5:10:99:111,0,114"   NA                       
#> [3,] "GT:AD:DP:GQ:PL" NA                          NA                       
#>      IN2009T1_us22              
#> [1,] "0|0:37,0:37:99:0,114,1709"
#> [2,] "0|1:2,1:3:16:16,0,48"     
#> [3,] "0|0:2,0:2:6:0,6,51"
pl <- extract.gt(vcf, element = "PL")
pl[1:4, 1:6]
#>                      BL2009P4_us23 DDR7602    IN2009T1_us22 LBUS5      NL07434 
#> Supercontig_1.50_2   "0,190,2835"  "0,39,585" "0,114,1709"  "0,39,585" NA      
#> Supercontig_1.50_246 "111,0,114"   NA         "16,0,48"     NA         NA      
#> Supercontig_1.50_549 NA            NA         "0,6,51"      NA         NA      
#> Supercontig_1.50_668 "0,3,44"      NA         "25,3,0"      NA         "0,3,28"
#>                      P10127    
#> Supercontig_1.50_2   "0,24,360"
#> Supercontig_1.50_246 "0,3,25"  
#> Supercontig_1.50_549 "99,12,0" 
#> Supercontig_1.50_668 "0,6,54"

pl1 <- masplit(pl, record = 1, sort = 0)
pl1[1:3, 1:4]
#>                      BL2009P4_us23 DDR7602 IN2009T1_us22 LBUS5
#> Supercontig_1.50_2               0       0             0     0
#> Supercontig_1.50_246           111      NA            16    NA
#> Supercontig_1.50_549            NA      NA             0    NA

Created on 2021-04-13 by the reprex package (v1.0.0)

This should give you the first PL element as a numeric.

But is the question you're trying to address is if chromX is heterozygous? Or are you trying to assess the quality of the genotypes? Because these are scaled, "0" is always the lowest. Just curious.

AnjaliC4 commented 3 years ago

Hi Brian,

Thanks for the example! I am trying to address if chrX is heterozygous. I believe in that case, I should extract the second element using mapsplit? Since, PL values are usually in 0/0:0/1:1/1 format. Please let me know what you think. Thank you.

knausb commented 3 years ago

Hi AnjaliC4,

PL is a comma delimited list of integers, one for each potential genotype. You could use this information to explore the confidence of the genotype call. Only one genotype is called per sample and per variant. So I feel your last statement was incorrect. If you just want the genotypes you could use the following.

gt <- extract.gt(vcf, element = "GT")
AnjaliC4 commented 3 years ago

Hi Brian, Thanks for the clarification. You are right. I am wondering in that case, how could I assess heterozygosity for a sample? Since GT may not tell anything about the quality of the genotype call. I also work with very sparse data. Is it possible to extract SNPs above a threshold for PL, to get genotypes called with high confidence? I could use only those above the threshold for estimating heterozygosity?

knausb commented 3 years ago

Hi AnjaliC4,

It appears to me that you have two questions here. First, is X heterozygous. Second, What is the quality of the genotypes.

If you ask your variant caller to call diploid variants on X you would expect some level of heterozygosity in a female because there are two copies of X. In males X is haploid so we would expect it to be homozygous. However, in genomic data we deal with lots of data so we would expect false positives. One way to manage this is to count the number of heterozygous positions on each chromosome and divide by it's length, giving us the rate of heterozygosity. I would expect that the autosomes would have a similar rate of heterozygosity. A female would have a rate of heterozygosity that is similar to the autosomes while a male X should have a decreased level of heterozygosity.

The "quality" of a genotype is much more difficult to estimate. I like to use depth and you can see how I use this at the below link.

https://knausb.github.io/vcfR_documentation/quick_intro.html

You seem to be interested in PL. One way of doing this would be to compare the best likelihood to the second best. A large difference would indicate confidence while a small difference would suggest ambiguity. With "phred scaled likelihoods" the "best' likelihood is always zero. So you can just use the value of the second lowest value. It has been my experience that repetitive regions that are poorly assembled have high coverage, which can lead to good likelihoods. My interpretation is that maximum likelihood does not understand copy number variation. So you'll have to make a judgement call here.

Good luck! Brian

AnjaliC4 commented 3 years ago

Thank you very much, Brian for your opinions and advice.