Closed Tato14 closed 6 years ago
Hi Tato14, I'm afraid I do not understand what you're trying to do. Can you elaborate? I've gzipped your file and see this is VCF data.
> vcf <- read.vcfR("41150_sorted_NC.vcf.gz", verbose = FALSE)
> vcf@meta[1]
[1] "##fileformat=VCFv4.2"
> queryMETA(vcf, "=AF")
[[1]]
[1] "FORMAT=ID=AF"
[2] "Number=A"
[3] "Type=Float"
[4] "Description=Allele frequency based on Flow Evaluator observation counts"
[[2]]
[1] "INFO=ID=AF"
[2] "Number=A"
[3] "Type=Float"
[4] "Description=Allele frequency based on Flow Evaluator observation counts"
You have AF
both in the INFO
column as well as in the GT
section. Which are you interested in? The VCF 4.2 specification in section 1.2.2 states that Number=number
is supposed to be an integer yet in your file it is a character. Is this a malformed file? You can access the AF
data from the GT
section as follows.
> af <- extract.gt(vcf, element = "AF")
> af1 <- masplit(af, record = 1, sort = 1)
> head(af1)
41150
COSM29313;COSM94984;COSM94985;COSM773 0.0322581
chr5_176517985 0.3840000
chr7_128845555 0.1633840
chr8_38308297 0.5486110
COSM39615;COSM4898;COSM4899;COSM4894;COSM4916;COSM4903;COSM4958;COSM4943 0.0365854
chr10_123350073 0.5401070
I hope that helps!
Hi @knausb! Thanks for the comment. You guessed more or less right what I was asking. What I would like to obtain is a vcf file with only the highest AF
(not best AF
- edited) and the rest of the GT
associated values.
masplit
function kind of solve the problem, but I am not sure if I can do it for the whole GT
section at once.
I would like to get a vcf file with entries like (took the first one of the attached file):
chr3 178952074 COSM29313;COSM94984;COSM94985;COSM773 G A 15.21 PASS AF=0.0322581;AO=3;DP=93;FAO=3;FDP=93;FDVR=10;FR=.REALIGNEDx0.03226;FRO=90;FSAF=2;FSAR=1;FSRF=44;FSRR=46;FUNC=[{'origPos':'178952074','origRef':'G','normalizedRef':'G','gene':'PIK3CA','normalizedPos':'178952074','normalizedAlt':'A','polyphen':'0.138','gt':'pos','codon':'ATA','coding':'c.3129G>A','grantham':'10.0','transcript':'NM_006218.3','function':'missense','protein':'p.Met1043Ile','location':'exonic','origAlt':'A','exon':'21','CLNACC1':'RCV000155959','CLNSIG1':'Likely_pathogenic','CLNREVSTAT1':'single','CLNID1':'rs121913283','CLNACC2':'RCV000201237','CLNSIG2':'Pathogenic','CLNREVSTAT2':'no_criteria','CLNID2':'rs121913283','CLNACC3':'RCV000420209','CLNSIG3':'Likely_pathogenic','CLNREVSTAT3':'no_criteria','CLNID3':'rs121913283','CLNACC4':'RCV000420901','CLNSIG4':'Likely_pathogenic','CLNREVSTAT4':'no_criteria','CLNID4':'rs121913283','CLNACC5':'RCV000423694','CLNSIG5':'Likely_pathogenic','CLNREVSTAT5':'no_criteria','CLNID5':'rs121913283','CLNACC6':'RCV000426516','CLNSIG6':'Likely_pathogenic','CLNREVSTAT6':'no_criteria','CLNID6':'rs121913283','CLNACC7':'RCV000430907','CLNSIG7':'Likely_pathogenic','CLNREVSTAT7':'no_criteria','CLNID7':'rs121913283','CLNACC8':'RCV000431600','CLNSIG8':'Likely_pathogenic','CLNREVSTAT8':'no_criteria','CLNID8':'rs121913283','CLNACC9':'RCV000433300','CLNSIG9':'Likely_pathogenic','CLNREVSTAT9':'no_criteria','CLNID9':'rs121913283','CLNACC10':'RCV000433967','CLNSIG10':'Likely_pathogenic','CLNREVSTAT10':'no_criteria','CLNID10':'rs121913283','CLNACC11':'RCV000438783','CLNSIG11':'Likely_pathogenic','CLNREVSTAT11':'no_criteria','CLNID11':'rs121913283','CLNACC12':'RCV000441596','CLNSIG12':'Likely_pathogenic','CLNREVSTAT12':'no_criteria','CLNID12':'rs121913283','CLNACC13':'RCV000442493','CLNSIG13':'Likely_pathogenic','CLNREVSTAT13':'no_criteria','CLNID13':'rs121913283'}];FWDB=-0.00672587;FXX=0;HRUN=1;HS;HS_ONLY=0;LEN=1;MLLD=187.855;OALT=A,C,T,C;OID=COSM29313,COSM94984,COSM773,COSM94985;OMAPALT=A,C,T,C;OPOS=178952074,178952074,178952074,178952074;OREF=G,G,G,G;PB=.;PBP=.;QD=0.654146;RBI=0.0724719;REFB=-0.00290631;REVB=0.0721592;RO=90;SAF=2;SAR=1;SRF=44;SRR=46;SSEN=0;SSEP=0;SSSB=0.128262;STB=0.670297;STBP=0.555;TYPE=snp;VARB=0.0678794 GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR 0/1:15:93:93:90:90:3:3:0.0322581:1:2:44:46:1:2:44:46
Thanks!
P.S.:
The VCF 4.2 specification in section 1.2.2 states that Number=number is supposed to be an integer yet in your file it is a character.
If I understand your comment right, I would like to comment that some special cases specify Number
as a character (see Section 1.2.2).
I modify the GT
section with the following script, in case anyone find it useful:
`
vcf <-read.vcfR("41150_sorted_NC.vcf") sample_name <- colnames(vcf@gt)[2] output_name <- paste(sample_name, "_FINAL.vcf.gz", sep = "") gt <- extract.gt(vcf, element = "GT") gq <- extract.gt(vcf, element = "GQ") dp <- extract.gt(vcf, element = "DP") fdp <- extract.gt(vcf, element = "FDP") ro <- extract.gt(vcf, element = "RO") fro <- extract.gt(vcf, element = "FRO") ao <- extract.gt(vcf, element = "AO") fao <- extract.gt(vcf, element = "FAO") af <- extract.gt(vcf, element = "AF") sar <- extract.gt(vcf, element = "SAR") saf <- extract.gt(vcf, element = "SAF") srf <- extract.gt(vcf, element = "SRF") srr <- extract.gt(vcf, element = "SRR") fsar <- extract.gt(vcf, element = "FSAR") fsaf <- extract.gt(vcf, element = "FSAF") fsrf<- extract.gt(vcf, element = "FSRF") fsrr <- extract.gt(vcf, element = "FSRR") m <- cbind(gq,dp,fdp,ro,fro,ao,fao,af,sar,saf,srf,srr,fsar,fsaf,fsrf,fsrr) m2 <- masplit(as.matrix(m), record = 1, sort = 1) m2 <- cbind(gt,m2) FORMAT <- rep(paste("GT","GQ","DP","FDP","RO","FRO","AO","FAO","AF","SAR","SAF","SRF","SRR","FSAR","FSAF","FSRF","FSRR",sep=':'), each=length(rownames(m2))) all_gt <- apply(m2[,1:17], 1, paste, collapse=":") prova <- cbind(FORMAT,all_gt) colnames(prova) <- c("FORMAT", sample_name) rownames(prova) <- NULL vcf@gt <- prova write.vcf(vcf, file = output_name)
`
Hi, thanks for pointing out the exception for the Number
field, I overlooked that. And if you have what you need, let's close this. You should be able to use masplit()
on large matrices of data. Also, extract.gt()
has a parameter extract
which you can set to FALSE
and it will return everything except the specified element. This may be faster than your example. But it also may require you to reformat your FORMAT
column, in which case your example may be better.
Thanks!
Hi,
I have some vcf file like the one attached. As you may see, I have some detected variants that, due to the fixed output format of Ion Torrent, present all variants for the detected position (
Number=A
). I would like to transform in toNumber=1
and keep only the most frequent variant looking atAF
.Any ideas how to do this? Thanks!
41150_sorted_NC.vcf.zip