knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

vcfR2tidy: behavior when > 1 record at same chrom/POS #214

Open gevro opened 2 months ago

gevro commented 2 months ago

Hi, When a vcf has > 1 record with the same chrom and POS, it would be impossible to join back the gt and the fix data frames to each other. ChromKey doesn't help in this situation because Chromkey + POS is no longer unique for a specific record in the original VCF.

knausb commented 2 months ago

Hi Gevro, thanks for the post, but I'm afraid I do not understand. I don't see any reason why one would convert back and forth between 'tidy' and 'VCF'. A major issue I see is that I feel that VCF data is 'wide' while tidy data is 'long'. There are many options to look this up online. Here's one.

http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

Converting between wide and long data will create a large computational burden, particularly with large data sets. If there is something specific you think we should address here please post a minimal reproducible example.

Thanks! Brian

gevro commented 2 months ago

Sure, happy to explain.

The need is a way to link between the 'fix' and the 'gt' tables after vcfR2tidy. Normally, that is enabled by the 'ChromKey' + POS fields.

The issue is that this relationship becomes ambiguous when the original vcf has > 1 record with the same chrom+POS. For example, a vcf could have one SNP and one Indel, both with the same chrom+POS, but they are different records. Another situation is where there are multiple SNP alleles at the same chrom+POS, but they are listed in separate rows rather than as a single multi-allelic record.

One of my students has found a work-around by creating a new 'variant_ID' column for the gt table that combines chromkey+POS+row_number() after grouping by 'Indiv', and for the fix table simply chromkey+POS+row_number().

knausb commented 2 months ago

Hi @gevro , thanks for the clarification! I've tried to create a minimal reproducible example below that I think demonstrates your issue. Could you please let me know if you agree? If you agree, could you please provide a use case to illustrate why someone would want to link between the $fix and $gt slots? I've never thought of this. But there may be an existing solution? But I feel I need to understand the issue better.

Thanks! Brian

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.15.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data("vcfR_test")
vcfR_test@fix
#>      CHROM POS       ID          REF   ALT      QUAL FILTER
#> [1,] "20"  "14370"   "rs6054257" "G"   "A"      "29" "PASS"
#> [2,] "20"  "17330"   NA          "T"   "A"      "3"  "q10" 
#> [3,] "20"  "1110696" "rs6040355" "A"   "G,T"    "67" "PASS"
#> [4,] "20"  "1230237" NA          "T"   NA       "47" "PASS"
#> [5,] "20"  "1234567" "microsat1" "GTC" "G,GTCT" "50" "PASS"
#>      INFO                               
#> [1,] "NS=3;DP=14;AF=0.5;DB;H2"          
#> [2,] "NS=3;DP=11;AF=0.017"              
#> [3,] "NS=2;DP=10;AF=0.333,0.667;AA=T;DB"
#> [4,] "NS=3;DP=13;AA=T"                  
#> [5,] "NS=3;DP=9;AA=G"
# Fabricate a new variant at an existing position.
vcfR_test@fix <- rbind(vcfR_test@fix, c("20", "1234567", "snp1", "G", "C", "50", "PASS", NA))
vcfR_test@gt <- rbind( vcfR_test@gt, vcfR_test@gt[2, ] )
vcfR_test@fix
#>      CHROM POS       ID          REF   ALT      QUAL FILTER
#> [1,] "20"  "14370"   "rs6054257" "G"   "A"      "29" "PASS"
#> [2,] "20"  "17330"   NA          "T"   "A"      "3"  "q10" 
#> [3,] "20"  "1110696" "rs6040355" "A"   "G,T"    "67" "PASS"
#> [4,] "20"  "1230237" NA          "T"   NA       "47" "PASS"
#> [5,] "20"  "1234567" "microsat1" "GTC" "G,GTCT" "50" "PASS"
#> [6,] "20"  "1234567" "snp1"      "G"   "C"      "50" "PASS"
#>      INFO                               
#> [1,] "NS=3;DP=14;AF=0.5;DB;H2"          
#> [2,] "NS=3;DP=11;AF=0.017"              
#> [3,] "NS=2;DP=10;AF=0.333,0.667;AA=T;DB"
#> [4,] "NS=3;DP=13;AA=T"                  
#> [5,] "NS=3;DP=9;AA=G"                   
#> [6,] NA
# Convert to tidy.
tidy_vcf <- vcfR2tidy(vcfR_test)
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element DP
#> Extracting gt element HQ
lapply(tidy_vcf, head)
#> $fix
#> # A tibble: 6 × 14
#>   ChromKey CHROM     POS ID     REF   ALT    QUAL FILTER    NS    DP AF    AA   
#>      <int> <chr>   <int> <chr>  <chr> <chr> <dbl> <chr>  <int> <int> <chr> <chr>
#> 1        1 20      14370 rs605… G     A        29 PASS       3    14 0.5   <NA> 
#> 2        1 20      17330 <NA>   T     A         3 q10        3    11 0.017 <NA> 
#> 3        1 20    1110696 rs604… A     G,T      67 PASS       2    10 0.33… T    
#> 4        1 20    1230237 <NA>   T     <NA>     47 PASS       3    13 <NA>  T    
#> 5        1 20    1234567 micro… GTC   G,GT…    50 PASS       3     9 <NA>  G    
#> 6        1 20    1234567 snp1   G     C        50 PASS      NA    NA <NA>  <NA> 
#> # ℹ 2 more variables: DB <lgl>, H2 <lgl>
#> 
#> $gt
#> # A tibble: 6 × 8
#>   ChromKey     POS Indiv   gt_GT gt_GQ gt_DP gt_HQ gt_GT_alleles
#>      <int>   <int> <chr>   <chr> <int> <int> <chr> <chr>        
#> 1        1   14370 NA00001 0|0      48     1 51,51 G|G          
#> 2        1   17330 NA00001 0|0      49     3 58,50 T|T          
#> 3        1 1110696 NA00001 1|2      21     6 23,27 G|T          
#> 4        1 1230237 NA00001 0|0      54     7 56,60 T|T          
#> 5        1 1234567 NA00001 0/1      35     4 <NA>  GTC/G        
#> 6        1 1234567 NA00001 0|0      49     3 58,50 G|G          
#> 
#> $meta
#> # A tibble: 6 × 5
#>   Tag   ID    Number Type    Description                
#>   <chr> <chr> <chr>  <chr>   <chr>                      
#> 1 INFO  NS    1      Integer Number of Samples With Data
#> 2 INFO  DP    1      Integer Total Depth                
#> 3 INFO  AF    A      Float   Allele Frequency           
#> 4 INFO  AA    1      String  Ancestral Allele           
#> 5 INFO  DB    0      Flag    dbSNP membership, build 129
#> 6 INFO  H2    0      Flag    HapMap2 membership
# Subset to position with two variants.
tidy_vcf$gt[ tidy_vcf$gt$ChromKey == 1 & tidy_vcf$gt$POS == 1234567, ]
#> # A tibble: 6 × 8
#>   ChromKey     POS Indiv   gt_GT gt_GQ gt_DP gt_HQ gt_GT_alleles
#>      <int>   <int> <chr>   <chr> <int> <int> <chr> <chr>        
#> 1        1 1234567 NA00001 0/1      35     4 <NA>  GTC/G        
#> 2        1 1234567 NA00001 0|0      49     3 58,50 G|G          
#> 3        1 1234567 NA00002 0/2      17     2 <NA>  GTC/GTCT     
#> 4        1 1234567 NA00002 0|1       3     5 65,3  G|C          
#> 5        1 1234567 NA00003 1/1      40     3 <NA>  G/G          
#> 6        1 1234567 NA00003 0/0      41     3 <NA>  G/G

Created on 2024-09-05 with reprex v2.1.1

gevro commented 2 months ago

Yup that seems like the right toy example. The next step would be to join the tidy_vcf$fix with tidy_vcf$gt based on ChromKey + POS columns, and that would cause mispairing of variants between fix and gt tables because there is > 1 variant with the same ChromKey + POS combination.

The use case would be if we had some set of filters we want to apply based on the fix table, and then we want to filter the gt table based on those filters. The way we would do it is to see which rows pass the filters in the fix table, then get the IDs of those variants, and then filter those out from the gt table. But if ChromKey + POS cannot be reliably used together to do that, that is where the issue happens.

knausb commented 2 months ago

Let's refer to the 'toy example' as a reproducible example. I think that's a good concept! Thank for agreeing that this example appears relevant. I think we could benefit from some more specifics on the 'use case'. For example, if you want to filter on the fix table why not perform this operation on the vcfR object prior to using vcfR2tidy()?

gevro commented 2 months ago

The reason is that creating complex filters and doing manipulations of the data that can then be filtered is much easier to do with tidy code.

knausb commented 2 months ago

Hi @gevro , I really need some specifics here, such as a 'reproducible example.' There is an entire 'tidyverse' of code out there, so simply asking for something 'tidy' is very vague. I can propose that an 'index' or 'primary key' may help. But because I have no idea of what you want to do, I have no way to address this. If you help to build a reproducible example it will provide specifics such as what function you may want to use, what the function's output is, and how that output may be used for subsequent steps. Will one function be used or will a series of functions be used? I feel I've provided plenty of examples, such as vcfR_documentation, so I think you have plenty of examples to build on. Think about an example that I can use to validate that any changes I may make actually address your desired functionality. Please help build a reproducible example with as many specifics as you can provide so that I might be able to help you.

tobyaicher commented 2 months ago

Hi,

I specific example would be if you have 3 samples in your tidy_vcf$gt object and there is a column "score". One function I might want is to sum the "score" column for each site across the samples in the gt object and then add this combined score to the tidy_vcf$fix object. I could then filter the fix sites based on this score. Here is an example of this:

`library(vcfR) library(tidyverse) data("vcfR_test")

tidy_vcf <- vcfR2tidy(vcfR_test)

tidy_vcf$gt$score = 1

tidy_vcf$gt %>% select("POS", "Indiv", "score") %>% pivot_wider(names_from = Indiv, values_from = score) %>% mutate(combined_score = rowSums(select(.,-POS))) %>% select(POS, combined_score) %>% left_join(tidy_vcf$fix, by = "POS")`

You can imagine more complex functions i.e. run fischer method to combine p-values from all the samples from a gt object and add this combined p-value to the tidy_vcf$fix object to then filter the fix object.

To do this however you have to resolve the fact that when you add the combined score from the gt object to the fix object you currently need to have a unique chrom/pos combination. (i.e. left_join using chrom and pos). Ideally there would be a unique variant ID that doesn't just rely on chrom and pos.

Thank you, let me know if this could use any clarification.

Best, Toby

knausb commented 2 months ago

Hi Toby, thank you for your example! That helps so much that I'm going to build on it! I have added 'VariantKey' to the vcfR tidy functions. Does this provides the functionality you're looking for?

library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.15.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
library(tidyverse)
data("vcfR_test")
tidy_vcf <- vcfR2tidy(vcfR_test)
#> Extracting gt element GT
#> Extracting gt element GQ
#> Extracting gt element DP
#> Extracting gt element HQ
tidy_vcf$gt$score = 1
#tidy_vcf$gt[1:3, ]
tidy_vcf$gt %>% 
  select("POS", "Indiv", "score") %>% 
  pivot_wider(names_from = Indiv, values_from = score) %>% 
  mutate(combined_score = rowSums(select(.,-POS))) %>% 
  select(POS, combined_score) %>% 
  left_join(tidy_vcf$fix, by = "POS")
#> # A tibble: 5 × 16
#>      POS combined_score ChromKey CHROM ID    REF   ALT    QUAL FILTER VariantKey
#>    <int>          <dbl>    <int> <chr> <chr> <chr> <chr> <dbl> <chr>       <int>
#> 1 1.44e4              3        1 20    rs60… G     A        29 PASS            1
#> 2 1.73e4              3        1 20    <NA>  T     A         3 q10             2
#> 3 1.11e6              3        1 20    rs60… A     G,T      67 PASS            3
#> 4 1.23e6              3        1 20    <NA>  T     <NA>     47 PASS            4
#> 5 1.23e6              3        1 20    micr… GTC   G,GT…    50 PASS            5
#> # ℹ 6 more variables: NS <int>, DP <int>, AF <chr>, AA <chr>, DB <lgl>,
#> #   H2 <lgl>

tidy_vcf$gt %>% select("VariantKey", "Indiv", "gt_DP") %>%
  group_by(Indiv) %>%
  summarize(gt_DP_sum = sum(gt_DP, na.rm = TRUE)) 
#> # A tibble: 3 × 2
#>   Indiv   gt_DP_sum
#>   <chr>       <int>
#> 1 NA00001        21
#> 2 NA00002        19
#> 3 NA00003        17

Created on 2024-09-16 with reprex v2.1.1

I've changed your choice to create and use 'score' with my choice to use 'DP' because it is already existing in the example dataset. I have also avoided the use of pivot_wider() because we already used vcfR2tidy() to make the data 'long' and then made it wide again. I am concerned that this will create a performance cost.

If you are interested in testing this, you should be able to build a development version of the package with the below command.

devtools::install_github(repo="knausb/vcfR@varkey")

Please let me know what you think. Thanks!

knausb commented 2 months ago

I was concerned about performance in the previous post. So I decided to do some benchmarking.

library(microbenchmark)
library(tidyverse)
library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.15.0 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
data(vcfR_test)

tidy_v1 <- function(vcf){
  tidy_vcf <- vcfR2tidy(vcf, verbose = FALSE)
  tidy_vcf$gt$score = 1
  tidy_result <- tidy_vcf$gt %>% 
    select("POS", "Indiv", "score") %>% 
    pivot_wider(names_from = Indiv, values_from = score) %>% 
    mutate(combined_score = rowSums(select(.,-POS))) %>% 
    select(POS, combined_score) %>% left_join(tidy_vcf$fix, by = "POS")
  tidy_result
}

tidy_v2 <- function(vcf){
  tidy_vcf <- vcfR2tidy(vcf, verbose = FALSE)
  tidy_result <- tidy_vcf$gt %>% 
    select("VariantKey", "Indiv", "gt_DP") %>%
    group_by(VariantKey) %>%
    summarize(gt_DP_mean = sum(gt_DP, na.rm = TRUE)) %>% 
    left_join(tidy_vcf$fix, by = "VariantKey")
  tidy_result
}

untidy_v1 <- function(vcf){
  dp <- extract.gt(vcf, as.numeric = TRUE)
  my_ret <- data.frame(vcf@fix, VariantSum = apply(dp, MARGIN = 1, sum, na.rm = T))
  my_ret
}

res <- microbenchmark(tidy_v1(vcfR_test), tidy_v2(vcfR_test), untidy_v1(vcfR_test),
                      times=100L)
print(res)
#> Unit: microseconds
#>                  expr       min         lq       mean     median         uq
#>    tidy_v1(vcfR_test) 62315.041 64936.2715 68311.0446 67061.0520 70810.0165
#>    tidy_v2(vcfR_test) 56027.712 58039.4035 61309.5122 59641.6935 62310.3335
#>  untidy_v1(vcfR_test)   383.944   458.9475   560.4468   536.8405   554.5365
#>         max neval cld
#>   96067.258   100 a  
#>  127709.252   100  b 
#>    4764.383   100   c

library(ggplot2)
p <- ggplot(data = res, mapping = aes( x = expr, y = time))
p + geom_violin() + theme_bw()

Created on 2024-09-16 with reprex v2.1.1

The version 'tidy_v1' includes a 'pivot_wider' call to convert our long data to wide data. The version 'tidy_v2' avoids the call to 'pivot_wider'. The version 'untidy_v1' may be considered a 'base R' flavor. By avoiding the use of 'pivot_wider' we see a decrease in execution time, leading to the suggestion that this should be considered 'unrecommended' if an alternate path is an option. The version 'untidy_v1' is 2 orders of magnitude faster. If folks are interested inn tidy options, we're trying to provide those! But if performance is a priority the base options are still best. Please let me know if you have opinions here!

tobyaicher commented 2 months ago

Thank you so much for looking into this! Yes VariantKey is exactly the functionality I was looking for - I downloaded your development version and tested it and it gave the same results as my own solution (where I used CHROM,POS, and rownumbers to create unique variant ID's).

I agree pivot wider is actually not necessary and using group_by before summarize is the best way to do this.

Noted that base R is faster - in my case performance is not much of an issue so prefer to use tidy because more concise and easier code so appreciate the tidy option.

Thanks again!

Toby

knausb commented 2 months ago

Thank you @tobyaicher ! I will try to use the reproducible example you helped create as a unit test to make sure this functionality remains in the future. I believe I need to update the documentation and do some more tests still. But I think we can consider this to be a part of the next release to CRAN.

Thanks! Brian