smithlabcode / ribotricer

A tool for accurately detecting actively translating ORFs from Ribo-seq data
http://doi.org/djv4
GNU General Public License v3.0
28 stars 8 forks source link

Confusion about orf information #92

Closed lbwfff closed 1 year ago

lbwfff commented 2 years ago

Hi, saketkc Excuse me again, I am a little confused about the orf information given by ribotricer. I found that the length of some orf is not an integer multiple of 3. For example, the following line: ENST00000600966.1_58350594_58353129_917 | annotated | translating | 0.616830824 | 2987 | 917 | 214 | 0.701639 | 9.793443 I don't quite understand such an orf, why does this happen? Thanks, LeeLee

saketkc commented 2 years ago

Thanks for reporting! I understand the gtf is from ensembl/gencode. Can you help me with the exact gtf version for this?

lbwfff commented 2 years ago

Thanks for reporting! I understand the gtf is from ensembl/gencode. Can you help me with the exact gtf version for this?

Hi,saketkc I use gencode.v35 (GRCh38) as my annotation file.

saketkc commented 2 years ago

I'll look into this deeper (I can confirm this is a bug), but since the start codon is a non-canonical one, I think it is safe to ignore such ORFs from your downstream analysis given this is human data.

annebresciani commented 2 years ago

Hi I had the same issue with Ensembl GTF GRCh38 v.104. Here is one example: ENST00000646156_8868027_8871900_100 annotated ENST00000646156 protein_coding ENSG00000074800 ENO1 protein_coding 1 - ATG 8868027-8868057,8870452-8870510,8871891-8871900 Thanks, Anne

polklin commented 1 year ago

Hello,

I have the same issue here using GENCODE (hg19) human release. https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz

I installed ribotricer via conda (https://anaconda.org/bioconda/ribotricer), release 1.3.2.

I have for example an annotated ORF for a protein coding gene (canonical start codon ATG) a length of 44: image

Thanks, Paul

polklin commented 1 year ago

Here are the files I am using to do the ORF preparation with Ribotricer:

I am using hg19 genome: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/

The command :

ribotricer prepare-orfs --gtf gtf_file\
 --fasta fasta_file \
 --prefix $PROJECT_DIR/ \
--min_orf_length 30 \
--start_codons ATG,CTG,GTG

I tried with and without --longest, I have the problem in both cases. Tested Ribotricer versions: 1.3.2 and 1.3.3

polklin commented 1 year ago

I have tried with latest GENCODE human GTF and fasta file (v42):

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/GRCh38.p13.genome.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/gencode.v42.annotation.gtf.gz

I still notice 12,031 ORF candidates with a length non multiple of 3. They are all coming from annotated CDS.

Here is one example:

Here is a screenshot taken from IGV.

image

saketkc commented 1 year ago

Thanks for bumping this issue, I will take a look.

polklin commented 1 year ago

Hi @saketkc,

Did you have a chance to have a look ? Let me know if I can provide any additional useful information.

Best, Paul

saketkc commented 1 year ago

Thanks for the reminder @polklin. I unfortunately haven't had a chance to take a look - I will take a look in the next couple of days.

saketkc commented 1 year ago

Hi @polklin, I finally had a chance to look and the explaination for your observation is simpler than I had first imagined.

This is the annotated CDS for the ENST00000480643.1 transcript:

chr1    HAVANA  CDS 1091500 1091543 .   -   0   gene_id "ENSG00000131591.18"; transcript_id "ENST00000480643.1"; gene_type "protein_coding"; gene_name "C1orf159"; transcript_type "protein_coding"; transcript_name "C1orf159-217"; exon_number 6; exon_id "ENSE00001952624.1"; level 2; protein_id "ENSP00000464657.1"; transcript_support_level "4"; hgnc_id "HGNC:26062"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000000745.9"; havana_transcript "OTTHUMT00000327126.3";

When performing ORF search, ribotricer treats CDS at their face value. i.e. if a particular region is annotated as CDS in the GTF, it does not perform de-novo search (based on the start and stop codons provided). In this case the 43nt long ORF is annotated as a CDS and hence the output. Hope this helps. Please let me know if you have any other questions.

polklin commented 1 year ago

Hello @saketkc, thank you very much for your investigation.

I'll check few other weird examples I noticed when generating candidate ORFs and keep you posted if I have other questions