Creation of ${GTF_FILE_BASE}.gene.bed and ${GTF_FILE_BASE}.exon.bed needs to operate off column matching

https://github.com/tgen/jetstream_resources/blob/6ffe2681fef962a8ba0881842aaf1aafd9b4469b/shared_resource_creation_scripts/create_gene_model.sh#L247-L258

We grab the value in column 10, e.g. $10, but this value is not always the gene_id value. For example we have the following for canfam3.1 ensemble 98:

X   ensembl gene    21422   25435   .   +   .   gene_source "ensembl"; gene_biotype "lncRNA"; transcript_id "ENSCAFG00000039510"; transcript_name "ENSCAFG00000039510"; gene_version "2"; gene_name "ENSCAFG00000039510"; gene_id "ENSCAFG00000039510"

Instead we might be able to grab the column for gene_id and add 1. For example:

awk -F '[\t"]' '$1 !~ /^#/ { for (i=1; i<=NF; i++) { f[$i] = i } if (a[$(f["gene_id"]+1)] == "" ) { a[$(f["gene_id"]+1)] = $1 ; b[$(f["gene_id"]+1)] = $4 ; c[$(f["gene_id"]+1)] = $5 ; next } ;
        if ($4 < b[$(f["gene_id"]+1)]) { b[$(f["gene_id"]+1)] = $4 } ;
        if ($5 > c[$(f["gene_id"]+1)]) { c[$(f["gene_id"]+1)] = $5 }
} END {
for (i in a) {
        OFS = "\t" ; print a[i], b[i], c[i], i
}
}' ${GTF_FILE} | sort -k1,1V -k2,2n -k3,3n > ${GTF_FILE_BASE}.gene.bed

tgen / jetstream_resources

Creation of ${GTF_FILE_BASE}.gene.bed and ${GTF_FILE_BASE}.exon.bed needs to operate off column matching #58