smithlabcode / ribotricer

A tool for accurately detecting actively translating ORFs from Ribo-seq data
http://doi.org/djv4
GNU General Public License v3.0
28 stars 8 forks source link

ORF_ID naming confusion #135

Closed Tim-Yu closed 1 year ago

Tim-Yu commented 1 year ago

Hi saketkc,

I hope all is well.

I want to trace back to the ORF coordinate on the genome, which I thought the ORF_ID and chrom should work. I thought ORF_ID is the combination of tx_id, start, end and the ORF length. But I found that the naming is not that way? may I ask what is the naming strategy for ORF_ID and where can I relocate the ORF coordinate? e.g.

ENST00000703342.1_1_130854336_130854428_93| overlap_dORF| translating | 0.8277447| 33| 93| 8| 0.25806452| 1.06451613| ENST00000703342.1_1| protein_coding| ENSG00000153310.22_12| CYRIB| protein_coding| chr8| -|

while the chr8:130854336-130854428 is FAM49B? image

Many many thanks,

Tim

saketkc commented 1 year ago

What version of the GTF are you using? Using v108 from Ensembl, I can confirm that ENST00000703342.1 is indeed on 8:

$ cat Homo_sapiens.GRCh38.108.chr.gtf | grep ENST00000703342
[..truncated..]
8   havana  CDS 129842145   129842205   .   -   1   gene_id "ENSG00000153310"; gene_version "22"; transcript_id "ENST00000703342"; transcript_version "1"; exon_number "14"; gene_name "CYRIB"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CYRIB-231"; transcript_source "havana"; transcript_biotype "protein_coding"; protein_id "ENSP00000515265"; protein_version "1"; tag "basic";
8   havana  stop_codon  129842142   129842144   .   -   0   gene_id "ENSG00000153310"; gene_version "22"; transcript_id "ENST00000703342"; transcript_version "1"; exon_number "14"; gene_name "CYRIB"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CYRIB-231"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
8   havana  five_prime_utr  130016607   130016727   .   -   .   gene_id "ENSG00000153310"; gene_version "22"; transcript_id "ENST00000703342"; transcript_version "1"; gene_name "CYRIB"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CYRIB-231"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
8   havana  five_prime_utr  129970943   129970995   .   -   .   gene_id "ENSG00000153310"; gene_version "22"; transcript_id "ENST00000703342"; transcript_version "1"; gene_name "CYRIB"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CYRIB-231"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
8   havana  five_prime_utr  129904499   129904585   .   -   .   gene_id "ENSG00000153310"; gene_version "22"; transcript_id "ENST00000703342"; transcript_version "1"; gene_name "CYRIB"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CYRIB-231"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
8   havana  five_prime_utr  129903312   129903350   .   -   .   gene_id "ENSG00000153310"; gene_version "22"; transcript_id "ENST00000703342"; transcript_version "1"; gene_name "CYRIB"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CYRIB-231"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
8   havana  five_prime_utr  129880449   129880458   .   -   .   gene_id "ENSG00000153310"; gene_version "22"; transcript_id "ENST00000703342"; transcript_version "1"; gene_name "CYRIB"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CYRIB-231"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";
8   havana  three_prime_utr 129839595   129842141   .   -   .   gene_id "ENSG00000153310"; gene_version "22"; transcript_id "ENST00000703342"; transcript_version "1"; gene_name "CYRIB"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; transcript_name "CYRIB-231"; transcript_source "havana"; transcript_biotype "protein_coding"; tag "basic";

FAM49B is empty:

$ cat Homo_sapiens.GRCh38.108.chr.gtf | grep FAM49B

I also could not locate it on the UCSC genome browser.

Tim-Yu commented 1 year ago

I am using GENECODE realse 19. I wonder if the ORF_ID is named after the start and end positions. Since chr8:130854336-130854428 is outside the annotated ENST00000703342.1

I want to get the ORF region at the genome level.

Thanks

Tim-Yu commented 1 year ago

No worries, it seems to be the GTF file is somehow shifted, thanks for your time.

saketkc commented 1 year ago

Sorry, I am a bit confused. Using v19 gencode:

$ cat gencode.v19.annotation.gtf | grep FAM49B
[..truncated..]
chr8    HAVANA  exon    131028853   131028898   .   -   .   gene_id "ENSG00000153310.14"; transcript_id "ENST00000523514.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM49B"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "FAM49B-013"; exon_number 1; exon_id "ENSE00002100074.1"; level 2; havana_gene "OTTHUMG00000164805.3"; havana_transcript "OTTHUMT00000380405.1";
chr8    HAVANA  exon    130982755   130983241   .   -   .   gene_id "ENSG00000153310.14"; transcript_id "ENST00000523514.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM49B"; transcript_type "retained_intron"; transcript_status "KNOWN"; transcript_name "FAM49B-013"; exon_number 2; exon_id "ENSE00002097805.1"; level 2; havana_gene "OTTHUMG00000164805.3"; havana_transcript "OTTHUMT00000380405.1";
chr8    HAVANA  transcript  130982922   131028802   .   -   .   gene_id "ENSG00000153310.14"; transcript_id "ENST00000518285.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM49B"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "FAM49B-014"; level 2; havana_gene "OTTHUMG00000164805.3"; havana_transcript "OTTHUMT00000380406.1";
chr8    HAVANA  exon    131028616   131028802   .   -   .   gene_id "ENSG00000153310.14"; transcript_id "ENST00000518285.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM49B"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "FAM49B-014"; exon_number 1; exon_id "ENSE00002118237.1"; level 2; havana_gene "OTTHUMG00000164805.3"; havana_transcript "OTTHUMT00000380406.1";
chr8    HAVANA  exon    130982922   130983241   .   -   .   gene_id "ENSG00000153310.14"; transcript_id "ENST00000518285.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM49B"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "FAM49B-014"; exon_number 2; exon_id "ENSE00002101537.1"; level 2; havana_gene "OTTHUMG00000164805.3"; havana_transcript "OTTHUMT00000380406.1";

and there is no transcript with id ENST00000703342 :

$ cat gencode.v19.annotation.gtf| grep ENST00000703342 
(no results)

Also there is no gene named CYRIB :

$ cat gencode.v19.annotation.gtf| grep CYRIB 
(no results)

Can you point me to the gtf and fasta files you used for creating the ribotricer index? Ideally these should be the same as what you used for mapping (using STAR or any other aligner).

Tim-Yu commented 1 year ago

I have figured out the problem, I did not use the same pointer for the fasta file in the workflow I generated. Thanks