tgen / jetstream_resources

Collection of scripts and README files tracking the source and generation of reference genomes and annotation files
MIT License
3 stars 1 forks source link

Creation of ${GTF_FILE_BASE}.gene.bed and ${GTF_FILE_BASE}.exon.bed needs to operate off column matching #58

Open bryce-turner opened 2 years ago

bryce-turner commented 2 years ago

https://github.com/tgen/jetstream_resources/blob/6ffe2681fef962a8ba0881842aaf1aafd9b4469b/shared_resource_creation_scripts/create_gene_model.sh#L247-L258

We grab the value in column 10, e.g. $10, but this value is not always the gene_id value. For example we have the following for canfam3.1 ensemble 98:

X   ensembl gene    21422   25435   .   +   .   gene_source "ensembl"; gene_biotype "lncRNA"; transcript_id "ENSCAFG00000039510"; transcript_name "ENSCAFG00000039510"; gene_version "2"; gene_name "ENSCAFG00000039510"; gene_id "ENSCAFG00000039510"

Instead we might be able to grab the column for gene_id and add 1. For example:

awk -F '[\t"]' '$1 !~ /^#/ { for (i=1; i<=NF; i++) { f[$i] = i } if (a[$(f["gene_id"]+1)] == "" ) { a[$(f["gene_id"]+1)] = $1 ; b[$(f["gene_id"]+1)] = $4 ; c[$(f["gene_id"]+1)] = $5 ; next } ;
        if ($4 < b[$(f["gene_id"]+1)]) { b[$(f["gene_id"]+1)] = $4 } ;
        if ($5 > c[$(f["gene_id"]+1)]) { c[$(f["gene_id"]+1)] = $5 }
} END {
for (i in a) {
        OFS = "\t" ; print a[i], b[i], c[i], i
}
}' ${GTF_FILE} | sort -k1,1V -k2,2n -k3,3n > ${GTF_FILE_BASE}.gene.bed
PedalheadPHX commented 2 years ago

This should really be based on a program that can read a GTF, maybe look if we can adopt the container we were working to sort out, it has lots of functions to parse gff and gtf files that might simplify this, otherwise you are correct we need to discover the position