knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

filter on both 'fix' and 'gt' lists #116

Closed tderrien closed 6 years ago

tderrien commented 6 years ago

Hello,

First of all, thank you for providing such a great package!

My question is probably naive but I cannot find the answer on the tutorial or previous issues. I have a vcf object that I've converted into 'tibble' called vcft using the dedicated vcfR2tidy() function (I provide an example of the vcft at the end).

My question is how can I filter variants on both the $fix and $gt lists e.g. finding variants that are both a duplication and heterozygous for the 2 individuals (fix list SVTYPE == "DUP" and gt list gt_GT="0/1")?

I read on the tutorial : Note that the fix and gt elements have a ‘ChromKey’ to help coordinate the variants in both structures but I'm not sure how to use this ChromKey.

Thank you for your help,

Best,

Thomas

Here is an overview of the vcft object:

str(vcft)
List of 3
 $ fix :Classes 'tbl_df', 'tbl' and 'data.frame':   122899 obs. of  25 variables:
  ..$ ChromKey : int [1:122899] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ CHROM    : chr [1:122899] "1" "1" "1" "1" ...
  ..$ POS      : int [1:122899] 7187 164530 166902 167967 168136 168159 235666 236839 237929 294575 ...
  ..$ ID       : chr [1:122899] "DUP00000000" "DEL00000001" "DEL00000002" "INV00000003" ...
  ..$ REF      : chr [1:122899] "A" "GAATCGAATCGAATTGAAAAGAATCGAATCGAAAAGAATTGAACTGAATGCAATCAAATCGAAAAACA" "TGAAAAGATTCGAATTGAATCGAATCGAATCACAAAGAATTGAATCGTATCAAATCGAAAAGAATCAAATC" "A" ...
  ..$ ALT      : chr [1:122899] "<DUP>" "G" "T" "<INV>" ...
  ..$ QUAL     : num [1:122899] 0 NA NA NA NA NA NA NA NA NA ...
  ..$ FILTER   : chr [1:122899] "PASS" "PASS" "LowQual" "LowQual" ...
  ..$ CIEND    : chr [1:122899] "-56,56" "-33,33" "-4,4" "-294,294" ...
  ..$ CIPOS    : chr [1:122899] "-56,56" "-33,33" "-4,4" "-294,294" ...
  ..$ CHR2     : chr [1:122899] "1" "1" "1" "1" ...
  ..$ END      : int [1:122899] 8704 164598 166973 7425684 168187 168235 25768879 26101159 25141788 294894 ...
  ..$ PE       : int [1:122899] 10 0 0 2 0 0 5 10 2 2 ...
  ..$ MAPQ     : int [1:122899] 26 60 5 5 28 12 7 46 8 60 ...
  ..$ SR       : int [1:122899] NA 10 9 NA 10 10 NA NA NA NA ...
  ..$ SRQ      : num [1:122899] NA 1 0.966 NA 0.955 ...
  ..$ CONSENSUS: chr [1:122899] NA "TCGAAAAGAATCAAATCAAATCGAATCGAAAAGAATAGAATTGTATTGAAAAGAATCGAATCGAATTGAAAAGAATCGAATCGAATTGATTGAATTGAATCGCAT" "GATCGAATTGAATCGATTCGAAAAGCAACAAATCAAAAAGAATCGAATCAAATGGAAAGAATCAAATCGAATCGAAAAGATCGAATTGAATCGAATTCAAAAGAATAGAATTGTAT" NA ...
  ..$ CE       : num [1:122899] NA 1.79 1.78 NA 1.75 ...
  ..$ CT       : chr [1:122899] "5to3" "3to5" "3to5" "3to3" ...
  ..$ IMPRECISE: logi [1:122899] TRUE FALSE FALSE TRUE FALSE FALSE ...
  ..$ PRECISE  : logi [1:122899] TRUE TRUE TRUE TRUE TRUE TRUE ...
  ..$ SVTYPE   : chr [1:122899] "DUP" "DEL" "DEL" "INV" ...
  ..$ SVMETHOD : chr [1:122899] "EMBL.DELLYv0.7.8" "EMBL.DELLYv0.7.8" "EMBL.DELLYv0.7.8" "EMBL.DELLYv0.7.8" ...
  ..$ INSLEN   : int [1:122899] NA 0 0 NA 0 0 NA NA NA NA ...
  ..$ HOMLEN   : int [1:122899] NA 33 6 NA 23 1 NA NA NA NA ...
 $ gt  :Classes 'tbl_df', 'tbl' and 'data.frame':   245798 obs. of  16 variables:
  ..$ ChromKey     : int [1:245798] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ POS          : int [1:245798] 7187 164530 166902 167967 168136 168159 235666 236839 237929 294575 ...
  ..$ Indiv        : chr [1:245798] "CL100086620_L1_DOGwhbRAAAA.merge" "CL100086620_L1_DOGwhbRAAAA.merge" "CL100086620_L1_DOGwhbRAAAA.merge" "CL100086620_L1_DOGwhbRAAAA.merge" ...
  ..$ gt_GT        : chr [1:245798] "0/1" "0/0" "0/1" "0/0" ...
  ..$ gt_GL        : chr [1:245798] "-4.19695,0,-3.09938" "0,-3.12396,-86.1956" "-11.3732,0,-128.15" "0,-15.7646,-254.01" ...
  ..$ gt_GQ        : int [1:245798] 31 31 114 157 10000 10000 10000 10000 140 102 ...
  ..$ gt_FT        : chr [1:245798] "PASS" "PASS" "PASS" "PASS" ...
  ..$ gt_RC        : int [1:245798] 90 5259 5409 1565242 8732 12233 5409867 5477430 5283390 9678 ...
  ..$ gt_RCL       : int [1:245798] 27 3044 3049 0 5050 7131 0 0 0 5113 ...
  ..$ gt_RCR       : int [1:245798] 55 1776 3575 48579 3988 8065 48579 48579 48579 8310 ...
  ..$ gt_CN        : int [1:245798] 2 2 2 64 2 2 223 226 218 1 ...
  ..$ gt_DR        : int [1:245798] 1 0 0 53 0 0 62 53 47 34 ...
  ..$ gt_DV        : int [1:245798] 2 0 0 0 0 0 2 13 0 0 ...
  ..$ gt_RR        : int [1:245798] 0 27 43 0 33 60 0 0 0 0 ...
  ..$ gt_RV        : int [1:245798] 0 2 8 0 11 30 0 0 0 0 ...
  ..$ gt_GT_alleles: chr [1:245798] "A/<DUP>" "GAATCGAATCGAATTGAAAAGAATCGAATCGAAAAGAATTGAACTGAATGCAATCAAATCGAAAAACA/GAATCGAATCGAATTGAAAAGAATCGAATCGAAAAGAATTGA"| __truncated__ "TGAAAAGATTCGAATTGAATCGAATCGAATCACAAAGAATTGAATCGTATCAAATCGAAAAGAATCAAATC/T" "A/A" ...
 $ meta:Classes 'tbl_df', 'tbl' and 'data.frame':   29 obs. of  5 variables:
  ..$ Tag        : chr [1:29] "INFO" "INFO" "INFO" "INFO" ...
  ..$ ID         : chr [1:29] "CIEND" "CIPOS" "CHR2" "END" ...
  ..$ Number     : chr [1:29] "2" "2" "1" "1" ...
  ..$ Type       : chr [1:29] "Integer" "Integer" "String" "Integer" ...
  ..$ Description: chr [1:29] "PE confidence interval around END" "PE confidence interval around POS" "Chromosome for END coordinate in case of a translocation" "End position of the structural variant" ...
knausb commented 6 years ago

Hi tderrien,

I have not seen VCF data like this before. Could you share what software you used to create it? I'm not saying there's anything wrong with it, just curious.

I'm not sure I follow your question. So let's try to create a minimal reproducible example. And I'm not sure I understand where the 'SVTYPE' data came from, perhaps the INFO column? If we work on the vcfR object the @fix matrix and the @gt matrix will have the same number of rows. So perhaps that would be a good place to start.

library(vcfR)
data("vcfR_test")
# Fabricate an example
vcfR_test@fix[,"INFO"] <- paste(getINFO(vcfR_test), paste("SVTYPE", c("DUP", "DEL", "DUP", "INV", "DUP"), sep = "="), sep = ";")
# Isolate duplicattions
vcfR_test <- vcfR_test[grep("SVTYPE=DUP", getINFO(vcfR_test))]
# Get the genotypes
gt <- extract.gt(vcfR_test)
# Add a homozygous variant for the example
gt[2,] <- "0/0"
gt <- is.het(gt)
# Isolate heterozygous variants
vcf <- vcfR_test[rowSums(gt) > 0,]
vcf
vcft <- vcfR2tidy(vcf)

Note that this doesn't really look like your example. I have no 'SVTYPE' in the fix list. So I'm missing some information. Could you take a look at this example and see if it helps? If not, could you try to help us get an example that matches your data better?

Thanks! Brian

tderrien commented 6 years ago

Hi Brian,

Thank you for your prompt reply! Actually, the .vcf has been produced by a Structural Variant (SV) caller named Delly (version v0.7.8). The SVType information comes from the vcft$fix$SVTYPE field while the genotype is in the vcft$gt so I got some issues combining both. If I understand correctly, you recommend to first work/filter on vcfR object rather than filter on the converted "vcfR tidy" object.

In any case, here is a link to get an example of the .vcf.

Thank you again!

knausb commented 6 years ago

Hi tderrien,

I've heard of delly, just never used it. Maybe I should?

Thanks for sharing your data. After inspecting it I believe we were very close. Although I appear to have missed a critical comma in the line containing the grep statement. I believe the following should get you what you need.

library(vcfR)
vcf <- read.vcfR("CL100086620_CL100090508.merged.vcf.gz")
# Isolate duplications
vcf <- vcf[grep("SVTYPE=DUP", getINFO(vcf)),]
# Get the genotypes
gt <- extract.gt(vcf)
# Add a homozygous variant for the example
gt <- is.het(gt)
# Isolate heterozygous variants
vcf <- vcf[rowSums(gt) > 0,]
vcf
vcft <- vcfR2tidy(vcf)

Is that what you were looking for?

tderrien commented 6 years ago

Hi Brian,

Yes, that's great! Thank you!