sr320 / course-fish546-2016

6 stars 5 forks source link

Summarizing fields from .vcf file #45

Closed jldimond closed 7 years ago

jldimond commented 7 years ago

I'd like to be able to summarize the following fields. I was trying to use VCF Tools to do this, but the .vcf file is not formatted the way it wants it to be. I think just extracting the columns into a new text file would be fine. I feel like I am getting there, but am posting the issue as we discussed yesterday.

screen shot 2016-10-19 at 7 41 02 am

An example file is located here:

https://github.com/jldimond/jldimond-fish546-2016/blob/master/analyses/data1_all.vcf

sr320 commented 7 years ago

Here is what you have....

##fileformat=VCFv4.0
##fileDate=2016/09/22
##source=ipyrad_v.0.3.41
##reference=past.fasta
##phasing=unphased
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=CATG,Number=1,Type=String,Description="Base Counts (CATG)">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  101_ddr 102_epi 103_ddr 103_epi 104_ddr 104_epi 105_ddr 105_epi 106_ddr 106_epi 107_ddr 107_epi 108_ddr 108_epi 109_ddr 109_epi 110_ddr 110_epi 111_ddr 111_epi 112_ddr 112_epi 113_epi 114_ddr 114_epi 115_ddr 115_epi 116_ddr 116_epi 117_ddr 117_epi 118_ddr 118_epi 120_epi 121_ddr 121_epi 122_ddr 122_epi 123_ddr 123_epi 124_ddr 124_epi 125_ddr 125_epi 126_ddr 126_epi 127_ddr 127_epi 128_ddr 128_epi 129_ddr 129_epi 130_ddr 130_epi 131_ddr 131_epi 80_ddr  80_epi  81_ddr  81_epi  82_ddr  82_epi  84_ddr  84_epi  85_ddr  85_epi  86_ddr  86_epi  87_ddr  87_epi  88b_ddr 88b_epi 89_ddr  89_epi  90_ddr  90_epi  91_ddr  91_epi  95_ddr  95_epi  96_ddr  96_epi  98_ddr  98_epi  99_ddr  99_epi  w11_ddr w11_epi w1_ddr  w1_epi  w3_ddr  w3_epi
7   0   .   T   .   13  PASS    NS=37;DP=380    GT:CATG 0/0:0,0,10,0    ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,10,0    0/0:0,0,23,0    0/0:0,8,0,0 0/0:0,0,12,0    ./.:0,0,0,0 0/0:0,0,7,0 ./.:0,0,0,0 0/0:0,0,6,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,6,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,14,0    0/0:0,0,6,0 0/0:0,0,6,0 0/0:0,0,18,0    ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,7,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,6,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,8,0 0/0:0,0,11,0    0/0:0,0,8,0 0/0:0,0,7,0 0/0:0,0,6,0 ./.:0,0,0,0 0/0:0,0,7,0 0/0:0,6,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,7,0 0/0:0,0,32,0    0/0:0,0,11,0    ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,8,0 ./.:0,0,0,0 0/0:0,0,18,0    0/0:0,0,26,0    ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,8,0 0/0:0,6,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,11,0    0/0:0,0,10,0    ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,9,0 0/0:0,0,8,0 ./.:0,0,0,0 0/0:0,0,10,0    0/0:0,0,6,0 0/0:0,0,8,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,10,0    ./.:0,0,0,0 ./.:0,0,0,0
7   1   .   A   .   13  PASS    NS=37;DP=380    GT:CATG 0/0:0,10,0,0    ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,10,0,0    0/0:0,23,0,0    0/0:0,8,0,0 0/0:0,12,0,0    ./.:0,0,0,0 0/0:0,7,0,0 ./.:0,0,0,0 0/0:0,6,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,6,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,14,0,0    0/0:0,6,0,0 0/0:0,6,0,0 0/0:0,18,0,0    ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,7,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,6,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,8,0,0 0/0:0,11,0,0    0/0:0,8,0,0 0/0:0,7,0,0 0/0:0,6,0,0 ./.:0,0,0,0 0/0:0,7,0,0 0/0:0,6,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,7,0,0 0/0:0,32,0,0    0/0:0,11,0,0    ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,8,0,0 ./.:0,0,0,0 0/0:0,18,0,0    0/0:0,26,0,0    ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,8,0,0 0/0:0,0,0,6 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,11,0,0    0/0:0,10,0,0    ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,9,0,0 0/0:0,8,0,0 ./.:0,0,0,0 0/0:0,10,0,0    0/0:0,6,0,0 0/0:0,8,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,10,0,0    ./.:0,0,0,0 ./.:0,0,0,0
7   2   .   A   G   13  PASS    NS=37;DP=378    GT:CATG 0/0:0,10,0,0    ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,10,0,0    0/0:0,23,0,0    0/0:0,0,8,0 0/0:0,12,0,0    ./.:0,0,0,0 0/0:0,7,0,0 ./.:0,0,0,0 1/0:0,3,0,3 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,6,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,13,0,0    0/0:0,6,0,0 0/0:0,6,0,0 0/0:0,17,0,1    ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,7,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,0,6,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,8,0,0 0/0:0,11,0,0    0/0:0,8,0,0 0/0:0,7,0,0 0/0:0,6,0,0 ./.:0,0,0,0 0/0:0,7,0,0 0/0:0,0,6,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,7,0,0 0/0:0,32,0,0    0/0:0,11,0,0    ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 0/0:0,8,0,0 ./.:0,0,0,0 1/1:0,0,0,18    1/1:0,0,0,26    ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 1/0:0,2,0,6 1/1:0,0,5,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 1/1:0,0,0,11    1/1:0,0,0,10    ./.:0,0,0,0 ./.:0,0,0,0 1/1:0,0,0,9 1/1:0,0,0,8 ./.:0,0,0,0 1/1:0,0,0,10    1/1:0,0,0,6 1/1:0,0,0,8 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 ./.:0,0,0,0 1/1:0,0,0,10    ./.:0,0,0,0 ./.:0,0,0,0

Provide an example of what you want to get to for these three lines..

jldimond commented 7 years ago

Ideally it would look something like this for the first three lines and first two records:

CHROM    101_ddr    102_epi
7               10             0
7               10             0
7               10             0

Really, I only need the first line for each record. It is important to note that for some records the base counts at each position vary, so this field needs to be summed. Example: 0/0:0,1,9,0 So need sum of 0,1,9,0 = 10

jldimond commented 7 years ago

Here's the workflow I worked on today. I did not push to course repo because ipyrad is running and gitignores are causing desktop to freeze.

https://github.com/jldimond/ipython-notebooks/blob/master/VCF_readcounts.ipynb