smithlabcode / ribotricer

A tool for accurately detecting actively translating ORFs from Ribo-seq data
http://doi.org/djv4
GNU General Public License v3.0
28 stars 8 forks source link

What does the ORF_ID represent? #147

Closed carajj closed 6 months ago

carajj commented 6 months ago

Hello,

I have run Ribotricer version 1.3.3 and created the index using Gencode v35. Below is the result from the test1_translating_ORFs.tsv file:

ORF_ID ORF_type status phase_score read_count length valid_codons valid_codons_ratio read_density transcript_id transcript_type gene_id gene_name gene_type chrom strand start_codon profile ENST00000420190.6_924432_939291_1074 annotated translating 0.5150787536377128 7 1074 7 0.019553072625698324 0.019553072625698324 ENST00000420190.6 protein_coding ENSG00000187634.11 SAMD11 protein_coding chr1 + ATG [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, .......]

What does the ORF_ID represent? Is it composed of the transcript ID, transcript start, transcript end, and ORF length in four parts? How can I obtain the start and end positions of the ORF in the genome? Is there any information that can suggest whether the ORF is located in the intergenic region of the genome?

Thanks a lot,

saketkc commented 6 months ago

Hi @carajj

Thanks for the question. ORF_ID is: transcriptid_firststart_lastend_totallength, where transcriptid is the transcript_id from your ensemble gtf, firststart is the start of the first exon within this ORF, lastend is the end of the last exon part of this ORF and totallength is the number of nucleotides spanning this ORF

You can extract the chromosome location and transcript location from the ribotricer index: Just look up the row with the ORF_ID and extract columns chromosome and coordinates: https://github.com/smithlabcode/ribotricer/blob/34bcf7f7c4a19e42b5225641e5eec638376d1eb2/ribotricer/prepare_orfs.py#L357

Hope this helps! Please feel free to reopen with any follow up questions.