Closed lbwfff closed 1 year ago
Thanks for reporting! I understand the gtf is from ensembl/gencode. Can you help me with the exact gtf version for this?
Thanks for reporting! I understand the gtf is from ensembl/gencode. Can you help me with the exact gtf version for this?
Hi,saketkc I use gencode.v35 (GRCh38) as my annotation file.
I'll look into this deeper (I can confirm this is a bug), but since the start codon is a non-canonical one, I think it is safe to ignore such ORFs from your downstream analysis given this is human data.
Hi I had the same issue with Ensembl GTF GRCh38 v.104. Here is one example: ENST00000646156_8868027_8871900_100 annotated ENST00000646156 protein_coding ENSG00000074800 ENO1 protein_coding 1 - ATG 8868027-8868057,8870452-8870510,8871891-8871900 Thanks, Anne
Hello,
I have the same issue here using GENCODE (hg19) human release. https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
I installed ribotricer via conda (https://anaconda.org/bioconda/ribotricer), release 1.3.2.
I have for example an annotated ORF for a protein coding gene (canonical start codon ATG) a length of 44:
Thanks, Paul
Here are the files I am using to do the ORF preparation with Ribotricer:
I am using hg19 genome: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/
fasta_file
: GRCh37.p13.genome.fa.gz
gtf_file
: gencode.v19.annotation.gtf.gz
The command :
ribotricer prepare-orfs --gtf gtf_file\
--fasta fasta_file \
--prefix $PROJECT_DIR/ \
--min_orf_length 30 \
--start_codons ATG,CTG,GTG
I tried with and without --longest
, I have the problem in both cases.
Tested Ribotricer versions: 1.3.2
and 1.3.3
I have tried with latest GENCODE human GTF and fasta file (v42):
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/GRCh38.p13.genome.fa.gz
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_42/gencode.v42.annotation.gtf.gz
I still notice 12,031 ORF candidates with a length non multiple of 3.
They are all coming from annotated
CDS.
Here is one example:
ENST00000480643.1_1091500_1091543_44
C1orf159
-
chr1
ATG
Here is a screenshot taken from IGV.
Thanks for bumping this issue, I will take a look.
Hi @saketkc,
Did you have a chance to have a look ? Let me know if I can provide any additional useful information.
Best, Paul
Thanks for the reminder @polklin. I unfortunately haven't had a chance to take a look - I will take a look in the next couple of days.
Hi @polklin, I finally had a chance to look and the explaination for your observation is simpler than I had first imagined.
This is the annotated CDS for the ENST00000480643.1 transcript:
chr1 HAVANA CDS 1091500 1091543 . - 0 gene_id "ENSG00000131591.18"; transcript_id "ENST00000480643.1"; gene_type "protein_coding"; gene_name "C1orf159"; transcript_type "protein_coding"; transcript_name "C1orf159-217"; exon_number 6; exon_id "ENSE00001952624.1"; level 2; protein_id "ENSP00000464657.1"; transcript_support_level "4"; hgnc_id "HGNC:26062"; tag "mRNA_end_NF"; tag "cds_end_NF"; havana_gene "OTTHUMG00000000745.9"; havana_transcript "OTTHUMT00000327126.3";
When performing ORF search, ribotricer treats CDS at their face value. i.e. if a particular region is annotated as CDS in the GTF, it does not perform de-novo search (based on the start and stop codons provided). In this case the 43nt long ORF is annotated as a CDS and hence the output. Hope this helps. Please let me know if you have any other questions.
Hello @saketkc, thank you very much for your investigation.
I'll check few other weird examples I noticed when generating candidate ORFs and keep you posted if I have other questions
Hi, saketkc Excuse me again, I am a little confused about the orf information given by ribotricer. I found that the length of some orf is not an integer multiple of 3. For example, the following line:
ENST00000600966.1_58350594_58353129_917 | annotated | translating | 0.616830824 | 2987 | 917 | 214 | 0.701639 | 9.793443
I don't quite understand such an orf, why does this happen? Thanks, LeeLee