loosolab / UROPA

Universal RObust Peak Annotator
https://uropa-manual.readthedocs.io/
MIT License
15 stars 6 forks source link

Unable to annotate the region #12

Closed RadPa closed 2 years ago

RadPa commented 2 years ago

Hi, I am unable to annotate the bed regions, even though I am not receiving any error messages. I am using RefSeq .gtf, enhanced by AGAT, and the gene_id is in LOC format. Uropa generated the output for all_hits and final_hits, but the regions were not annotated. uropa -b merged.bed -g .agat.sort.gtf --show_attributes gene_id gene_name --feature_anchor start --distance 20000 10000 --feature gene

XXn_047627.1    6990    9390    peak_1  .   .   gene    16299   18150   -   start   9960    Downstream  0.0 0.0 NA  NA  query_1
XXn_047627.1    16740   17030   peak_2  .   .   gene    16299   18150   -   start   1265    PeakInsideFeature   1.0 0.157   NA  NA  query_1

Can you please help? Thank you Radhika

msbentsen commented 2 years ago

Hi Radhika, It looks like the peaks were correctly annotated, but the attributes given in --show_attributes were not properly shown (=NA) in the results. Can you share a few lines of your .agat.sort.gtf? Then I can have a look why they are not found. Thanks!

RadPa commented 2 years ago

Hi, Thank you for looking into it. This is the .gtf file, I have used.

NC_047627.1 Gnomon  mRNA    16300   18139   .   -   .   ID=XM_034259881.1;Parent=LOC117575606;db_xref=GeneID:117575606;gbkey=Gene;gene=LOC117575606;gene_biotype=protein_coding;gene_id=LOC117575606
NC_047627.1 Gnomon  exon    16300   17237   .   -   .   ID=exon-5;Parent=XM_034259881.1;db_xref=GeneID:117575606;exon_number=2;gbkey=mRNA;gene=LOC117575606;gene_id=LOC117575606;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=uncharacterized LOC117575606%2C transcript variant X1;transcript_id=XM_034259881.1
NC_047627.1 Gnomon  three_prime_UTR 16300   17216   .   -   .   ID=nbis-three_prime_utr-17249;Parent=XM_034259881.1;db_xref=GeneID:117575606;exon_number=1;gbkey=mRNA;gene=LOC117575606;gene_id=LOC117575606;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=uncharacterized LOC117575606%2C transcript variant X1;transcript_id=XM_034259881.1
NC_047627.1 Gnomon  RNA 16300   18150   .   -   .   ID=XR_004572888.1;Parent=LOC117575606;db_xref=GeneID:117575606;gbkey=Gene;gene=LOC117575606;gene_biotype=protein_coding;gene_id=LOC117575606
NC_047627.1 Gnomon  exon    16300   17087   .   -   .   ID=exon-3;Parent=XR_004572888.1;db_xref=GeneID:117575606;exon_number=3;gbkey=misc_RNA;gene=LOC117575606;gene_id=LOC117575606;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=uncharacterized LOC117575606%2C transcript variant X2;transcript_id=XR_004572888.1
NC_047627.1 Gnomon  exon    17156   17237   .   -   .   ID=exon-2;Parent=XR_004572888.1;db_xref=GeneID:117575606;exon_number=2;gbkey=misc_RNA;gene=LOC117575606;gene_id=LOC117575606;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=uncharacterized LOC117575606%2C transcript variant X2;transcript_id=XR_004572888.1
NC_047627.1 Gnomon  CDS 17217   17237   .   -   0   ID=cds-2;Parent=XM_034259881.1;db_xref=GeneID:117575606;exon_number=2;gbkey=CDS;gene=LOC117575606;gene_id=LOC117575606;product=uncharacterized protein LOC117575606;protein_id=XP_034115772.1;transcript_id=XM_034259881.1
NC_047627.1 Gnomon  stop_codon  17217   17219   .   -   0   ID=stop_codon-1;Parent=XM_034259881.1;db_xref=GeneID:117575606;exon_number=2;gbkey=CDS;gene=LOC117575606;gene_id=LOC117575606;product=uncharacterized protein LOC117575606;protein_id=XP_034115772.1;transcript_id=XM_034259881.1
NC_047627.1 Gnomon  exon    17306   18139   .   -   .   ID=exon-4;Parent=XM_034259881.1;db_xref=GeneID:117575606;exon_number=1;gbkey=mRNA;gene=LOC117575606;gene_id=LOC117575606;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 5 samples with support for all annotated introns;product=uncharacterized LOC117575606%2C transcript variant X1;transcript_id=XM_034259881.1

Good Day Radhika

msbentsen commented 2 years ago

Hi Radhika,

Great, thanks for the file - I see the issue. The file you posted is in gff format (as supposed to gtf), which UROPA is unfortunately not able to read. There is a reference here on the difference between the different formats: https://m.ensembl.org/info/website/upload/gff.html. The main difference is in how the 'attribute' column is formatted, which differs between gff/gtf.

I just made a quick search and found this list of options for conversion from gff to gtf - I didn't try them myself, but maybe they are of help: https://github.com/NBISweden/GAAS/blob/master/annotation/knowledge/gff_to_gtf.md

Hope this helps you out!

BR Mette

RadPa commented 2 years ago

Thank you for the resources and Uropa. Both of them helped.

Good Day Radhika