knausb / vcfR

Tools to work with variant call format files
246 stars 55 forks source link

Heterozygous SNP calling #197

Closed ettorefedele closed 2 years ago

ettorefedele commented 2 years ago

Hello,

I have recently generated a vcf file containing autosomal SNP data for diploid organisms using the samtools mpileup | bcftools call pipeline. Because these individuals were genotyped in the past, I know what I expect to see in the vcf file. Although bcftools predicts genotypes, I was wondering whether there is a way in vcfR to exploit the "AD" information to assess site heterozygosity? I would like to set a threshold of 0.2, above which a SNP site can be considered heterozygous. So far I have imported my vcf and extracted "AD" information:

ad<-extract.gt(vcf,element="AD") ad1 <- masplit(ad, record = 1) ad2 <- masplit(ad, record = 2)

Now, I would like to know if there is a way to tell the software to call heterozygous all those sites where ad1/ad2 or ad2/ad1 > 0.20?

Thank you very much.

knausb commented 2 years ago

Hi, we've put some similar information on the ploidy section of the documentation. Please take a look and let me know if this helps. If not, perhaps start with a mre. Thanks!

ettorefedele commented 2 years ago

Hi Brian,

many thanks for your quick response. I studied the ploidy section of the documentation, but this doesn't appear to address my specific issue. In fact, it seems it focuses more on determining ploidy rather than explaining how to call genotypes based on allele depth ratio (i.e. ad1/ad2 and ad2/ad1). I understand this may be a question you don't get asked very often, so I really appreciate your help this time.

Thanks!

On Mon, Mar 14, 2022 at 9:11 PM Brian Knaus @.***> wrote:

Hi, we've put some similar information on the ploidy section https://knausb.github.io/vcfR_documentation/determining_ploidy_1.html of the documentation. Please take a look and let me know if this helps. If not, perhaps start with a mre https://knausb.github.io/vcfR_documentation/reporting_issue.html. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/knausb/vcfR/issues/197#issuecomment-1067294655, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUHO32D62X6A3HJO4QMGKYLU76TQ3ANCNFSM5QWBCHSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

knausb commented 2 years ago

Hi,

Is the following what you're looking for?

# Load libraries
library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.12.0.9999 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
library(pinfsc50)

# Determine file locations
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz",
                        package = "pinfsc50")

# Read data into memory
vcf <- read.vcfR(vcf_file, verbose = FALSE)
vcf
#> ***** Object of Class vcfR *****
#> 18 samples
#> 1 CHROMs
#> 22,031 variants
#> Object size: 22.4 Mb
#> 7.929 percent missing data
#> *****        *****         *****

ad <- extract.gt(vcf, element = 'AD')
allele1 <- masplit(ad, record = 1)
allele2 <- masplit(ad, record = 2)

ad1 <- allele1 / (allele1 + allele2)
ad2 <- allele2 / (allele1 + allele2)

gt <- extract.gt(vcf)

table(gt[ ad1 > 0.2 & ad1 < 0.8 ])
#> 
#>   0|0   0|1   0|2   0|3   1|0   1|1   1|2   1|3   2|0   2|1   2|2   2|3   2|4 
#>    47 26194   317     8 25838    26   110     1   275    86     1     9     2 
#>   3|0   3|1   3|2   3|4   4|1   4|2 
#>     6     7     7     1     1     1

gt[ ad1 > 0.2 & ad1 < 0.8 & !is.na(ad1) ] <- "0/1" # Note that we do not know phase here.
gt[ ad2 > 0.2 & ad2 < 0.8 & !is.na(ad2) ] <- "0/1" # Note that we do not know phase here.
table(gt[ ad1 > 0.2 & ad1 < 0.8 ])
#> 
#>   0/1 
#> 52937

Created on 2022-03-15 by the reprex package (v2.0.1)

If you're looking to paste the genotypes back to your VCF data you might try this page.

ettorefedele commented 2 years ago

Hi,

many thanks once again for the reply.

Sorry, I have just realised my question was a bit confusing. I am trying to find a way to tweak vcfR to call heterozygous those SNPs for which the ratio allele1/allele2 and allele2/allele1 is >0.2 So instead of looking at ad1 = allele1/(allele1+allele2) and ad2 = allel2/(allele1+allele2), I would rather look at allele1/allele2 and allele2/allele1 and then paste the genotypes back to my VCF.

knausb commented 2 years ago

Why don't you try the suggestion of creating a minimal reproducible example? You can use my above code as a starting point.

ettorefedele commented 2 years ago

Hi,

I gave it a go and this is what I managed to do:

Load libraries

library(vcfR)

>

> ** vcfR * ***

> This is vcfR 1.12.0.9999

> browseVignettes('vcfR') # Documentation

> citation('vcfR') # Citation

>

library(pinfsc50)

Determine file locations

vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz",

gt<-extract.gt(vcf)

Now, I would need to tell vcfR to call heterozygous all those sites where ad1_over_ad2 and ad2_over_ad1 are both > 0.2, and call the others homozygous either for the REF or for the ALT.

On Wed, Mar 16, 2022 at 9:33 PM Brian Knaus @.***> wrote:

Why don't you try the suggestion of creating a minimal reproducible example? You can use my above code as a starting point.

— Reply to this email directly, view it on GitHub https://github.com/knausb/vcfR/issues/197#issuecomment-1069657237, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUHO32GYLE6HYTTHNBF5Q3DVAJHR5ANCNFSM5QWBCHSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

knausb commented 2 years ago

Can you modify the lines of code that I've already provided to get the result you want?

ettorefedele commented 2 years ago

Hi,

it works fine. Many thanks for your help. In the end I did exactly as suggested and computed ad1 and ad2 as described above.

I have one final question though. How can I do a similar thing when my SNPs are phased? I am referring to the following bit in particular:

gt[ ad1 > 0.2 & ad1 < 0.8 & !is.na(ad1) ] <- "0/1" # Note that we do not know phase here. gt[ ad2 > 0.2 & ad2 < 0.8 & !is.na(ad2) ] <- "0/1" # Note that we do not know phase here.

What if we knew phase?

On Fri, Mar 18, 2022 at 5:01 PM Brian Knaus @.***> wrote:

Can you modify the lines of code that I've already provided to get the result you want?

— Reply to this email directly, view it on GitHub https://github.com/knausb/vcfR/issues/197#issuecomment-1072607501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUHO32E4WQVAU5T4XRB7ORTVASZEXANCNFSM5QWBCHSQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

knausb commented 2 years ago

In the VCF specification unphased genotypes are coded with the forward slash (0/1) while phased data is treated with the pipe (0|1). We do not know the order of unphased data so we can use either 0/1 or 1/0 to code for heterozygotes. I haven't really worked with phased data before, but I think the short answer is that it will be more complicated. If your genotyping updating efforts result in you replacing a heterozygote with a heterozygote I think you do nothing. But if you want to change a homozygote to a heterozygote then I think that might be something to think about. A phased homozygote (0|0) is actually uninformative about phase. If you reverse the alleles of a homozygote you still have a homozygote. So I don't think you'd have the information necessary to call phase. So I think you'll have to use an unphased heterozygote as your only option. Perhaps you can rephase the data after your update? Does this make sense?

ettorefedele commented 2 years ago

Hi Brian,

sorry I have another question that I can't seem to solve.

I have been using the scripts above to assess the heterozygosity of additional samples - and they work fine except for one sample in particular. See example below:

vcf <- read.vcfR(vcf_file, verbose = FALSE) vcf

ad <- extract.gt(vcf, element = 'AD') allele1 <- masplit(ad, record = 1) allele2 <- masplit(ad, record = 2)

ad1 <- allele1 / (allele1 + allele2) ad2 <- allele2 / (allele1 + allele2)

gt <- extract.gt(vcf)

table(gt[ ad1 > 0.2 & ad1 < 0.8 ])

gt[ ad1 > 0.2 & ad1 < 0.8 & !is.na(ad1) ] <- "0/1" gt[ ad2 > 0.2 & ad2 < 0.8 & !is.na(ad2) ] <- "0/1" table(gt[ ad1 > 0.2 & ad1 < 0.8 ])

I then paste the "new" genotypes back to my vcf file using the following commands:

gt <- matrix( paste(gt, sep=":"), nrow=nrow(gt), dimnames=dimnames(gt) )

is.na(gt[gt == "NA:NA"]) <- TRUE

@.***[,-1] <- gt

So, according to the scripts, all loci for which "ad1 > 0.2 & ad1 < 0.8" and "ad2 > 0.2 & ad2 < 0.8" should have genotype "0/1". However, one sample, which has ad1 = 0.857 and ad2 =0.143, has genotype "0/1" instead of "0/0" in the new version of the vcf file. How is this possible? The coverage for allele 1 is 559 and for allele 2 is 93 - from these I then derived ad1 and ad2. Is there something wrong in my scripts?

Thank you very much!

On Wed, Mar 23, 2022 at 12:23 AM Brian Knaus @.***> wrote:

In the VCF specification unphased genotypes are coded with the forward slash (0/1) while phased data is treated with the pipe (0|1). We do not know the order of unphased data so we can use either 0/1 or 1/0 to code for heterozygotes. I haven't really worked with phased data before, but I think the short answer is that it will be more complicated. If your genotyping updating efforts result in you replacing a heterozygote with a heterozygote I think you do nothing. But if you want to change a homozygote to a heterozygote then I think that might be something to think about. A phased homozygote (0|0) is actually uninformative about phase. If you reverse the alleles of a homozygote you still have a homozygote. So I don't think you'd have the information necessary to call phase. So I think you'll have to use an unphased heterozygote as your only option. Perhaps you can rephase the data after your update? Does this make sense?

— Reply to this email directly, view it on GitHub https://github.com/knausb/vcfR/issues/197#issuecomment-1075778912, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUHO32F52WK3AMDLEEUO5BLVBJQALANCNFSM5QWBCHSQ . You are receiving this because you authored the thread.Message ID: @.***>

knausb commented 2 years ago

Hi, I'm not sure I follow those last few lines of code. Perhaps you should review them to make sure you understand what they're attempting to accomplish. Without a complete example, including data, it's hard to tell. The code appears to change genotypes that are within your specification, ad1 > 0.2 & ad1 < 0.8 and "ad2 > 0.2 & ad2 < 0.8, but does not manipulate genotypes outside this specification. Is it possible that you have heterozygotes in the data you began with?

ettorefedele commented 2 years ago

Hi, thanks for the reply. Yes - that is exactly what the code is supposed to do. The locus is indeed heterozygous to begin with, but I thought that the script could somehow be used to convert also heterozygous sites back to homozygous in case they do not fall within the specified values, ad1 > 0.2 & ad1 < 0.8 and ad2 >0.2 & ad2 < 0.8. Is there a way to achieve this?

I thought of adding the following two lines to the scripts but it doesn't solve the problem: gt[ ad1 < 0.2 & ad1 > 0.8 & !is.na(ad1) ] <- "0/0" gt[ ad2 < 0.2 & ad2 > 0.8 & !is.na(ad2) ] <- "0/0"

What I am trying to do is basically telling vcfR to call heterozygous all the loci for which ad1 > 0.2 & ad1 < 0.8 and ad2 >0.2 & ad2 < 0.8, and homozygous if ad1 and ad2 do not fall within the range. Thank you very much again for your help!

On Thu, Jun 23, 2022 at 8:32 PM Brian Knaus @.***> wrote:

Hi, I'm not sure I follow those last few lines of code. Perhaps you should review them to make sure you understand what they're attempting to accomplish. Without a complete example, including data, it's hard to tell. The code appears to change genotypes that are within your specification, ad1 > 0.2 & ad1 < 0.8 and "ad2 > 0.2 & ad2 < 0.8, but does not manipulate genotypes outside this specification. Is it possible that you have heterozygotes in the data you began with?

— Reply to this email directly, view it on GitHub https://github.com/knausb/vcfR/issues/197#issuecomment-1164790490, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUHO32G5PJCZGMDPSUQTFMTVQS3T3ANCNFSM5QWBCHSQ . You are receiving this because you authored the thread.Message ID: @.***>

knausb commented 2 years ago

Can you initialize the matrix as homozygous

gt[!is.na(gt)] <- "0/0"

and then add your heterozygotes?

ettorefedele commented 2 years ago

Thanks, that works for that specific locus that is now called as homozygous - as it should be. So I first homogenise my matrix using the script you have just provided and then add my heterozygotes:

gt <- extract.gt(vcf) gt[!is.na(gt)] <- "0/0"

gt[ ad1 > 0.2 & ad1 < 0.8 & !is.na(ad1) ] <- "0/1" gt[ ad2 > 0.2 & ad2 < 0.8 & !is.na(ad2) ] <- "0/1" table(gt[ ad1 > 0.2 & ad1 < 0.8 ])

However, the problem now is that when I initialize the matrix as homozygous (i.e. gt[!is.na(gt)] <- "0/0") I should also add the option to have homozygous for the ALT (i.e. "1/1"). In fact, loci can be homozygous for the REF and for the ALT. How can I achieve that?

On Fri, Jun 24, 2022 at 8:45 PM Brian Knaus @.***> wrote:

Can you initialize the matrix as homozygous

gt[!is.na(gt)] <- "0/0"

and then add your heterozygotes?

— Reply to this email directly, view it on GitHub https://github.com/knausb/vcfR/issues/197#issuecomment-1165885028, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUHO32HVQASSHNMK4XSJZFLVQYF5BANCNFSM5QWBCHSQ . You are receiving this because you authored the thread.Message ID: @.***>

knausb commented 2 years ago

Hi @ettorefedele , I'm not going to write your code for you. That's your responsibility. You appear to be trying to create your own genotype caller. I suggest you look into how authors of other genotype callers have addressed these issues.