lh3 / bioawk

BWK awk modified for biological data
590 stars 120 forks source link

Make array of VCF info and GFF group field. #6

Open ghuls opened 12 years ago

ghuls commented 12 years ago

GTF file:

$ zcat ./test.gtf.gz
chr21   hg19_knownGene  exon    33026871    33027740    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  exon    33030247    33030540    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  exon    33031710    33031813    0.000000    -   .   gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
chr21   hg19_knownGene  start_codon 33032083    33032085    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33032083    33032154    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33031935    33032154    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33036103    33036199    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33036103    33036199    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33038762    33038831    0.000000    +   2   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33038762    33038831    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33039571    33039688    0.000000    +   1   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33039571    33039688    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  CDS 33040784    33040888    0.000000    +   0   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  stop_codon  33040889    33040891    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
chr21   hg19_knownGene  exon    33040784    33041243    0.000000    +   .   gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";

List some fields of GTF with bioawk:

$ ./bioawk -c gff '{ print $feature,$start,$group }' ./test.gtf.gz
exon    33026871    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
exon    33030247    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
exon    33031710    gene_id "uc002yoz.1"; transcript_id "uc002yoz.1"; 
start_codon 33032083    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33032083    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33031935    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33036103    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33036103    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33038762    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33038762    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33039571    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33039571    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
CDS 33040784    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
stop_codon  33040889    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3"; 
exon    33040784    gene_id "uc002ypa.3"; transcript_id "uc002ypa.3";

It would be nice if for the GFF group field and the VCF info field, $group and $info, is an array which used each subfeature as key:

$ ./bioawk -c gff '{ print $feature,$start,$group[gene_id],$group[transcript_id] }' ./test.gtf.gz
exon    33026871    uc002yoz.1  uc002yoz.1
exon    33030247    uc002yoz.1  uc002yoz.1
exon    33031710    uc002yoz.1  uc002yoz.1
start_codon 33032083    uc002ypa.3  uc002ypa.3
CDS 33032083    uc002ypa.3  uc002ypa.3
exon    33031935    uc002ypa.3  uc002ypa.3
CDS 33036103    uc002ypa.3  uc002ypa.3
exon    33036103    uc002ypa.3  uc002ypa.3
CDS 33038762    uc002ypa.3  uc002ypa.3
exon    33038762    uc002ypa.3  uc002ypa.3
CDS 33039571    uc002ypa.3  uc002ypa.3
exon    33039571    uc002ypa.3  uc002ypa.3
CDS 33040784    uc002ypa.3  uc002ypa.3
stop_codon  33040889    uc002ypa.3  uc002ypa.3
exon    33040784    uc002ypa.3  uc002ypa.3

At the moment it is not supported and I get the following error:

./bioawk: can't assign to group; it's an array name.
 source line number 1
ghuls commented 12 years ago

This:

$ ./bioawk -c gff '{ print $feature,$start,$group[gene_id],$group[transcript_id] }' ./test.gtf.gz

should be:

$ ./bioawk -c gff '{ print $feature,$start,$group["gene_id"],$group["transcript_id"] }' ./test.gtf.gz

of course.