knausb / vcfR

Tools to work with variant call format files
248 stars 54 forks source link

Keep only the highest AF and its information from Ion Torrent data #112

Closed Tato14 closed 6 years ago

Tato14 commented 6 years ago

Hi,

I have some vcf file like the one attached. As you may see, I have some detected variants that, due to the fixed output format of Ion Torrent, present all variants for the detected position (Number=A). I would like to transform in to Number=1 and keep only the most frequent variant looking at AF.

Any ideas how to do this? Thanks!

41150_sorted_NC.vcf.zip

knausb commented 6 years ago

Hi Tato14, I'm afraid I do not understand what you're trying to do. Can you elaborate? I've gzipped your file and see this is VCF data.

> vcf <- read.vcfR("41150_sorted_NC.vcf.gz", verbose = FALSE)
> vcf@meta[1]
[1] "##fileformat=VCFv4.2"
> queryMETA(vcf, "=AF")
[[1]]
[1] "FORMAT=ID=AF"                                                           
[2] "Number=A"                                                               
[3] "Type=Float"                                                             
[4] "Description=Allele frequency based on Flow Evaluator observation counts"

[[2]]
[1] "INFO=ID=AF"                                                             
[2] "Number=A"                                                               
[3] "Type=Float"                                                             
[4] "Description=Allele frequency based on Flow Evaluator observation counts"

You have AF both in the INFO column as well as in the GT section. Which are you interested in? The VCF 4.2 specification in section 1.2.2 states that Number=number is supposed to be an integer yet in your file it is a character. Is this a malformed file? You can access the AF data from the GT section as follows.

> af <- extract.gt(vcf, element = "AF")
> af1 <- masplit(af, record = 1, sort = 1)
> head(af1)
                                                                             41150
COSM29313;COSM94984;COSM94985;COSM773                                    0.0322581
chr5_176517985                                                           0.3840000
chr7_128845555                                                           0.1633840
chr8_38308297                                                            0.5486110
COSM39615;COSM4898;COSM4899;COSM4894;COSM4916;COSM4903;COSM4958;COSM4943 0.0365854
chr10_123350073                                                          0.5401070

I hope that helps!

Tato14 commented 6 years ago

Hi @knausb! Thanks for the comment. You guessed more or less right what I was asking. What I would like to obtain is a vcf file with only the highest AF (not best AF - edited) and the rest of the GT associated values.

masplit function kind of solve the problem, but I am not sure if I can do it for the whole GT section at once.

I would like to get a vcf file with entries like (took the first one of the attached file):

chr3    178952074   COSM29313;COSM94984;COSM94985;COSM773   G   A   15.21   PASS    AF=0.0322581;AO=3;DP=93;FAO=3;FDP=93;FDVR=10;FR=.REALIGNEDx0.03226;FRO=90;FSAF=2;FSAR=1;FSRF=44;FSRR=46;FUNC=[{'origPos':'178952074','origRef':'G','normalizedRef':'G','gene':'PIK3CA','normalizedPos':'178952074','normalizedAlt':'A','polyphen':'0.138','gt':'pos','codon':'ATA','coding':'c.3129G>A','grantham':'10.0','transcript':'NM_006218.3','function':'missense','protein':'p.Met1043Ile','location':'exonic','origAlt':'A','exon':'21','CLNACC1':'RCV000155959','CLNSIG1':'Likely_pathogenic','CLNREVSTAT1':'single','CLNID1':'rs121913283','CLNACC2':'RCV000201237','CLNSIG2':'Pathogenic','CLNREVSTAT2':'no_criteria','CLNID2':'rs121913283','CLNACC3':'RCV000420209','CLNSIG3':'Likely_pathogenic','CLNREVSTAT3':'no_criteria','CLNID3':'rs121913283','CLNACC4':'RCV000420901','CLNSIG4':'Likely_pathogenic','CLNREVSTAT4':'no_criteria','CLNID4':'rs121913283','CLNACC5':'RCV000423694','CLNSIG5':'Likely_pathogenic','CLNREVSTAT5':'no_criteria','CLNID5':'rs121913283','CLNACC6':'RCV000426516','CLNSIG6':'Likely_pathogenic','CLNREVSTAT6':'no_criteria','CLNID6':'rs121913283','CLNACC7':'RCV000430907','CLNSIG7':'Likely_pathogenic','CLNREVSTAT7':'no_criteria','CLNID7':'rs121913283','CLNACC8':'RCV000431600','CLNSIG8':'Likely_pathogenic','CLNREVSTAT8':'no_criteria','CLNID8':'rs121913283','CLNACC9':'RCV000433300','CLNSIG9':'Likely_pathogenic','CLNREVSTAT9':'no_criteria','CLNID9':'rs121913283','CLNACC10':'RCV000433967','CLNSIG10':'Likely_pathogenic','CLNREVSTAT10':'no_criteria','CLNID10':'rs121913283','CLNACC11':'RCV000438783','CLNSIG11':'Likely_pathogenic','CLNREVSTAT11':'no_criteria','CLNID11':'rs121913283','CLNACC12':'RCV000441596','CLNSIG12':'Likely_pathogenic','CLNREVSTAT12':'no_criteria','CLNID12':'rs121913283','CLNACC13':'RCV000442493','CLNSIG13':'Likely_pathogenic','CLNREVSTAT13':'no_criteria','CLNID13':'rs121913283'}];FWDB=-0.00672587;FXX=0;HRUN=1;HS;HS_ONLY=0;LEN=1;MLLD=187.855;OALT=A,C,T,C;OID=COSM29313,COSM94984,COSM773,COSM94985;OMAPALT=A,C,T,C;OPOS=178952074,178952074,178952074,178952074;OREF=G,G,G,G;PB=.;PBP=.;QD=0.654146;RBI=0.0724719;REFB=-0.00290631;REVB=0.0721592;RO=90;SAF=2;SAR=1;SRF=44;SRR=46;SSEN=0;SSEP=0;SSSB=0.128262;STB=0.670297;STBP=0.555;TYPE=snp;VARB=0.0678794   GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR   0/1:15:93:93:90:90:3:3:0.0322581:1:2:44:46:1:2:44:46

Thanks!

P.S.:

The VCF 4.2 specification in section 1.2.2 states that Number=number is supposed to be an integer yet in your file it is a character.

If I understand your comment right, I would like to comment that some special cases specify Number as a character (see Section 1.2.2).

Tato14 commented 6 years ago

I modify the GT section with the following script, in case anyone find it useful:

`

vcf <-read.vcfR("41150_sorted_NC.vcf") sample_name <- colnames(vcf@gt)[2] output_name <- paste(sample_name, "_FINAL.vcf.gz", sep = "") gt <- extract.gt(vcf, element = "GT") gq <- extract.gt(vcf, element = "GQ") dp <- extract.gt(vcf, element = "DP") fdp <- extract.gt(vcf, element = "FDP") ro <- extract.gt(vcf, element = "RO") fro <- extract.gt(vcf, element = "FRO") ao <- extract.gt(vcf, element = "AO") fao <- extract.gt(vcf, element = "FAO") af <- extract.gt(vcf, element = "AF") sar <- extract.gt(vcf, element = "SAR") saf <- extract.gt(vcf, element = "SAF") srf <- extract.gt(vcf, element = "SRF") srr <- extract.gt(vcf, element = "SRR") fsar <- extract.gt(vcf, element = "FSAR") fsaf <- extract.gt(vcf, element = "FSAF") fsrf<- extract.gt(vcf, element = "FSRF") fsrr <- extract.gt(vcf, element = "FSRR") m <- cbind(gq,dp,fdp,ro,fro,ao,fao,af,sar,saf,srf,srr,fsar,fsaf,fsrf,fsrr) m2 <- masplit(as.matrix(m), record = 1, sort = 1) m2 <- cbind(gt,m2) FORMAT <- rep(paste("GT","GQ","DP","FDP","RO","FRO","AO","FAO","AF","SAR","SAF","SRF","SRR","FSAR","FSAF","FSRF","FSRR",sep=':'), each=length(rownames(m2))) all_gt <- apply(m2[,1:17], 1, paste, collapse=":") prova <- cbind(FORMAT,all_gt) colnames(prova) <- c("FORMAT", sample_name) rownames(prova) <- NULL vcf@gt <- prova write.vcf(vcf, file = output_name)

`

knausb commented 6 years ago

Hi, thanks for pointing out the exception for the Number field, I overlooked that. And if you have what you need, let's close this. You should be able to use masplit() on large matrices of data. Also, extract.gt() has a parameter extract which you can set to FALSE and it will return everything except the specified element. This may be faster than your example. But it also may require you to reformat your FORMAT column, in which case your example may be better.

Thanks!