knausb / vcfR

Tools to work with variant call format files
247 stars 55 forks source link

INFO2df cuts strings with "=" sign #130

Closed thedam closed 5 years ago

thedam commented 5 years ago

INFO2df(vcf) so when I have an info like this: CSQ=C|downstream_gene_variant|MODIFIER|NOC2L|26155|Transcript|NM_015658.3|protein_coding||||||||||rs6672356|1|1752|-1||EntrezGene|||T|T|OK|||1|||0.9999|0.9994|0.9999|1|1|1|0.9999|1|0.9997|||||rs6672356|9.99804e-01,C|synonymous_variant|LOW|SAMD11|148398|Transcript|NM_152486.2|protein_coding|10/14||NM_152486.2:c.1027T>C|NP_689699.2:p.Arg343=|1107|1027|343|R|Cgg/Cgg|rs6672356|1||1||EntrezGene|||T|C|OK|||1|||0.9999|0.9994|0.9999|1|1|1|0.9999|1|0.9997|||||rs6672356|9.99804e-01

only this part is imported: C|downstream_gene_variant|MODIFIER|NOC2L|26155|Transcript|NM_015658.3|protein_coding||||||||||rs6672356|1|1752|-1||EntrezGene|||T|T|OK|||1|||0.9999|0.9994|0.9999|1|1|1|0.9999|1|0.9997|||||rs6672356|9.99804e-01,C|synonymous_variant|LOW|SAMD11|148398|Transcript|NM_152486.2|protein_coding|10/14||NM_152486.2:c.1027T>C|NP_689699.2:p.Arg343

Example vcf file (uploaded as txt, as git doesn't allow other extension) ct.txt

knausb commented 5 years ago

Hi @thedam, thanks for the feedback! I see the following.


library(vcfR)
#> 
#>    *****       ***   vcfR   ***       *****
#>    This is vcfR 1.8.0.9000 
#>      browseVignettes('vcfR') # Documentation
#>      citation('vcfR') # Citation
#>    *****       *****      *****       *****
getwd()
#> [1] "/tmp/Rtmp6288mO/reprex223011edf9f0"
list.files()
#> [1] "reprex_reprex.R"        "reprex_reprex.spin.R"  
#> [3] "reprex_reprex.spin.Rmd"
# Import data.
# reprex() apparently goes to a tempdir.
vcf <- read.vcfR("/home/local/USDA-ARS/knausb/Desktop/sandbox/thedam/ct.vcf")
#> Scanning file to determine attributes.
#> File attributes:
#>   meta lines: 122
#>   header_line: 123
#>   variant count: 1
#>   column count: 10
#> 
Meta line 122 read in.
#> All meta lines processed.
#> gt matrix initialized.
#> Character matrix gt created.
#>   Character matrix gt rows: 1
#>   Character matrix gt cols: 10
#>   skip: 0
#>   nrows: 1
#>   row_num: 0
#> 
Processed variant: 1
#> All variants processed
# Check the INFO column.
vcf@fix[,"INFO"]
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          INFO 
#> "FS=0;MQ=60;QD=33.61;SOR=0.941;DP=16;AD=16;RD=0;CSQ=C|downstream_gene_variant|MODIFIER|NOC2L|26155|Transcript|NM_015658.3|protein_coding||||||||||rs6672356|1|1752|-1||EntrezGene|||T|T|OK|||1|||0.9999|0.9994|0.9999|1|1|1|0.9999|1|0.9997|||||rs6672356|9.99804e-01,C|synonymous_variant|LOW|SAMD11|148398|Transcript|NM_152486.2|protein_coding|10/14||NM_152486.2:c.1027T>C|NP_689699.2:p.Arg343=|1107|1027|343|R|Cgg/Cgg|rs6672356|1||1||EntrezGene|||T|C|OK|||1|||0.9999|0.9994|0.9999|1|1|1|0.9999|1|0.9997|||||rs6672356|9.99804e-01"
INFO2df(vcf)
#>   RD AD DP REP ADrat BaseQRankSum MQ MQRankSum    QD ReadPosRankSum   SOR
#> 1  0 16 16  NA    NA           NA 60        NA 33.61             NA 0.941
#>   FS
#> 1  0
#>                                                                                                                                                                                                                                                                                                                                                CSQ
#> 1 C|downstream_gene_variant|MODIFIER|NOC2L|26155|Transcript|NM_015658.3|protein_coding||||||||||rs6672356|1|1752|-1||EntrezGene|||T|T|OK|||1|||0.9999|0.9994|0.9999|1|1|1|0.9999|1|0.9997|||||rs6672356|9.99804e-01,C|synonymous_variant|LOW|SAMD11|148398|Transcript|NM_152486.2|protein_coding|10/14||NM_152486.2:c.1027T>C|NP_689699.2:p.Arg343
#>   gnomADg gnomADg_AF
#> 1      NA         NA

Created on 2019-04-02 by the reprex package (v0.2.1)

The string in the INFO column is a semi-colon delimits set of key-value pairs that are delimited by the equals sign. The function INFO2df() Uses the keys to make columns for each key and then populates each column with the values. So I see this as expected behaviour.

Perhaps you could provide a bit of context. What are you trying to accomplish. Why do you consider this undesired behaviour?

thedam commented 5 years ago

you are loosing this information in the output table: |1107|1027|343|R|Cgg/Cgg|rs6672356|1||1||EntrezGene|||T|C|OK|||1|||0.9999|0.9994|0.9999|1|1|1|0.9999|1|0.9997|||||rs6672356|9.99804e-01

This is the VCF file annotated with VEP Ensembl tag (CSQ). CSQ tag is a "list" of some values separated by a pipe "|". Synonymous changes written in HGVSp are marekd by equal sign "=". This is correct notation: NP_689699.2:p.Arg343=

so the CSQ tag, describing TWO TRANSCIRPTS, like in my example: CSQ=C|foo|bar|NP_689699.2:p.Arg343Leu|boo|boo2|boo3,C|foo|bar|NP_689699.2:p.Arg343=|boo|boo2|boo3 will be separated incorrectly. With INFO2df() I'll obtain columns:

C   foo   bar   NP_689699.2:p.Arg343Leu   boo  boo2  boo3 
C   foo   bar   NP_689699.2:p.Arg343

instead of

C   foo   bar   NP_689699.2:p.Arg343Leu   boo  boo2  boo3
C   foo   bar   NP_689699.2:p.Arg343=   boo  boo2  boo3

Finally what I want to obtain is the VCF file transformed to tabular form keeping all information separated. Equal "=" sign is fully qualified HGVS notation, so in this case it should be broken in this position.

This CSQ is a very long string, that at the end has to be separated to different rows (separating by commas ","), and to different columns (sperarated by a pipe "|")

:)

thedam commented 5 years ago

So here is what I wanted to do. Before vcfR operation, I had to "manually" replace "=" sign in VCF's CSQ tag. Try it to run with commenting "scenario 1" and uncommenting "scenario 2"

library(vcfR, quietly = T, verbose=F)
library(dplyr, quietly = T, verbose = F)
library(tidyr)
library(stringr)

vcf_file_path <- "ct2.vcf"

########### **scenario 1** ##################
#### replace "=" in "CSQ" tag ##############
######################################
vcf_text  <- readLines(vcf_file_path)
vcf_text_replaced  <- gsub(pattern = "=\\|", replace = "___\\|", x = vcf_text)
vcf_new_file_path=str_replace(vcf_file_path,".vcf",".R.vcf")
writeLines(vcf_text_replaced, con=vcf_new_file_path)
######################################
############# **scenario 2** ################
# vcf_new_file_path <- vcf_file_path 
######################################

vcf<-read.vcfR(vcf_new_file_path, verbose = FALSE)
vcfFix <- as.data.frame(getFIX(vcf), stringsAsFactors = F)

## if VCF has only one variant, than we obtian transposed DF.
if (dim(vcfFix)[2] ==1 ){
  vcfFix<-t(vcfFix)
  rownames(vcfFix)<-1
}

vcf_raw_DF <- cbind(vcfFix, INFO2df(vcf)) #a new temporary DF

gt<-extract_gt_tidy(vcf,verbose = F) #extracting GT 

vcf_raw_DF["gt"]<-gt$gt_GT # adding gt column do our DF

## split CSQ to separate rows
vcf_raw_csq_DF <- vcf_raw_DF %>% separate_rows(CSQ, sep = ",")

### info header 
csqDesc   <- queryMETA(vcf, element="INFO=<ID=CSQ")
csqDesc   <- csqDesc[[1]][4]
csqDesc   <- str_split(csqDesc, ": ")[[1]][2]
csqHeader_list <- str_split(csqDesc, "\\|")[[1]]

vcf_DF <- vcf_raw_csq_DF %>% separate(CSQ, into=csqHeader_list, sep="\\|")
vcf_DF <- vcf_DF %>% arrange(CHROM,POS,REF,ALT)

write.table(vcf_DF, "out.tsv", quote=F,row.names = F, na="", sep = "\t")

ct2.txt

knausb commented 5 years ago

Hi @thedam, Thanks for the detailed response! I think I'm beginning to understand. The catch is that I do not study human, so I know nothing about this. My interpretation of the VCF specification is that in section 1.6.1 subsection 8 says that no equals signs are not permitted. That said, this sounds reasonable to accommodate. And because its human the rest of the field may head this direction.

Is there any documentation for this? Before your post it was my experience was strictly key value pairs. You've provided one example, and that's all I know about this. So I don't really know how general of a solution we need. I have found some info at VEP data formats but did not find an example similar to yours. I've also found Human Genome Variation Soiciety's variant nomenclature but was unsuccessful at finding relevant info. (Again, I do not study human so I'm not familiar with some of these resources.) If you could point me towards some documentation on this it would help determine if I'm addressing your one case or whether I need a more general solution. Thanks!

thedam commented 5 years ago

Hmm, indeed the VCF specification says: INFO - additional information: (String, no white-space, semi-colons, or equals-signs permitted, so you are right:)

The VEP guys follow HGVS nomenclature, so they are tied too.

About silent changes is written here: http://varnomen.hgvs.org/recommendations/protein/variant/substitution/

silent (no change)
NP_003997.1:p.Cys188=
amino acid Cys188 is not changed (DNA level change ..TGC.. to ..TGT..)
NOTE: the description p.= means the entire protein coding region was analysed and no variant was found that changes (or is predicted to change) the protein sequence.

But yes, VCF standards should be first, so you are right. I'll note this issue to VEP guys, I guess they should change notation from: NP_003997.1:p.Cys188= to NP_003997.1:p.Cys188Cys

Thanks for making me aware of such specification's details :) I've posted the new issue under VEP git, they should handle it: https://github.com/Ensembl/ensembl-vep/issues/430