tgen / CovGen

Creates a target specific exome_full192.coverage.txt file required by MutSig
MIT License
21 stars 9 forks source link

Malformed step1 BED file due to awk's handling of the OFS variable #13

Closed ning-y closed 3 years ago

ning-y commented 3 years ago

The last awk script of the bash chunk generating the step1 BED file will produce a malformed BED file where the first line is wrongly space-delimited, although the second till last lines are correctly tab-delimited. MWE is provided at the end of this issue.

https://github.com/tgen/CovGen/blob/fc0fff88b21d5bc2fb5e549da012055231b6a67a/CovGen#L275-L285

I am not experienced with awk at all, but I believe this may be due to some quirky awk behavior as to when the OFS variable is being set. In particular, if I set the OFS in a BEGIN block, the output is now all correctly tab-delimited.

gawk -F'\t' '{ OFS = "\t" ; print $1,$2 - 125,$3 + 125 }' ${TARGETS} | \
    sort -k1,1 -k2,2n | \
    gawk -F'\t' '$2 <= 0 { $2 = 0 } ; { OFS = "\t" ; print $0 }' | \
    bedtools merge -i - | \
    gawk -F'\t' 'BEGIN{OFS = "\t"} { if($2 != 0) { $2 = $2+25 } ;
    $3 = $3-25 ;
    print $0 }' | cat -t

Below is a MWE. The first two commands set a small BED file as test input. The third and fourth commands demonstrate the error described in this issue. The last is the proposed fix, which I will make a pull request for soon.

$ TARGETS=test.bed

$ head ${TARGETS}
1       12080   12251   ref|DDX11L1,ref|LOC102725121,ref|NR_148357,ref|NR_046018,ens|ENST00000515242,ens|ENST00000518655,ens|ENST00000450305,ens|ENST00000456328
1       12595   12802   ref|DDX11L1,ref|LOC102725121,ref|NR_148357,ref|NR_046018,ens|ENST00000515242,ens|ENST00000518655,ens|ENST00000450305,ens|ENST00000456328
1       13163   13658   ref|DDX11L1,ref|LOC102725121,ref|NR_148357,ref|NR_046018,ens|ENST00000515242,ens|ENST00000518655,ens|ENST00000450305,ens|ENST00000456328
1       14620   15015   ref|WASH7P,ref|NR_024540,ens|ENST00000488147,ens|ENST00000538476,ens|ENST00000438504,ens|ENST00000541675,ens|ENST00000423562
1       15795   15914   ref|WASH7P,ref|NR_024540,ens|ENST00000488147,ens|ENST00000438504,ens|ENST00000538476,ens|ENST00000541675,ens|ENST00000423562

$ gawk -F'\t' '{ OFS = "\t" ; print $1,$2 - 125,$3 + 125 }' ${TARGETS} | \
    sort -k1,1 -k2,2n | \
    gawk -F'\t' '$2 <= 0 { $2 = 0 } ; { OFS = "\t" ; print $0 }' | \
    bedtools merge -i - | \
    gawk -F'\t' '{ if($2 != 0) { $2 = $2+25 } ;
    $3 = $3-25 ;
    OFS = "\t" ;
    print $0 }' > out.bed

$ cat -t out.bed
1 11980 12351
1^I12495^I12902
1^I13063^I13758
1^I14520^I15115
1^I15695^I16014

$ gawk -F'\t' '{ OFS = "\t" ; print $1,$2 - 125,$3 + 125 }' ${TARGETS} | \
    sort -k1,1 -k2,2n | \
    gawk -F'\t' '$2 <= 0 { $2 = 0 } ; { OFS = "\t" ; print $0 }' | \
    bedtools merge -i - | \
    gawk -F'\t' 'BEGIN{OFS = "\t"} { if($2 != 0) { $2 = $2+25 } ;
    $3 = $3-25 ;
    print $0 }' | cat -t
1^I11980^I12351
1^I12495^I12902
1^I13063^I13758
1^I14520^I15115
1^I15695^I16014

For reference I am on:

$ gawk --version
GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)