ksiewert / BetaScan

Genome-wide scan for balancing selection using beta statistic
27 stars 5 forks source link

awk code for generating input files #4

Closed josieparis closed 5 years ago

josieparis commented 5 years ago

Hey Katie!

Am using your awk code (you emailed me before) for generating the input files from the vcfs and think there's a typo:

awk -F "\t|:" '(NR>1) && ($6!='0') && ($6!='1') && ($3=='2') {OFS="\t"; print$2,$6*$4,$4}' chr1.AA.APHP.out.frq

should be:

awk -F "\t|:" '(NR>1) && ($6!='0') && ($6!='1') && ($3=='2') {OFS="\t"; print$2,$8*$4,$4}' chr1.AA.APHP.out.frq

just to demonstrate: the vcf line looks like this: chr1 4827 chr1_4827 G A 38420.2 PASS AA=G

the vcftools freqs line looks like this: chr1 4827 2 24 G 0.416667 A 0.583333

so A is the derived allele, and therefore the code should be column 8* sample size?

4827 14 24

not 4827 10 24

thanks! josie

ksiewert commented 5 years ago

Hi Josie,

Yes, if the second allele is the derived allele, then $8*$4 is correct. Good catch! I've updated the FAQ page with the revised command, and added mention of the --derived flag so that the derived allele will be listed second when people use vcftools.

Thank you! Katie

josieparis commented 5 years ago

No worries! Thanks!