Closed singhbhavya closed 1 day ago
Hi there! I re-ran RiboTricer with a TE annotation (gtf). All the samples ran up until this point. I'm having trouble making sense of this error message; please let me know if you could help! The "index" file warning is not an issue, because the same samples ran with the same warning when the gtf was different. It seems the program fails after "WARNING: no periodic read length found... using cutoff 0.418", but no other warnings or errors are reported. In addition, all the reported files and graphs (except for read_length_dist) are empty. The metagene_profile tables are all also exclusively populated with zeroes.
(ribotricer_env) $ ribotricer detect-orfs --bam results/squire/Aza3-A/Aza3-A.bam --ribotricer_index results/ribotricer/ribotricer_candidate_orfs_TE.tsv --prefix results/ribotricer/Aza3-A_TE/Aza3-A_ribotricer --phase_score_cutoff 0.418
May 02 11:28:16 ..... started ribotricer detect-orfs
May 02 11:28:16 ... started parsing ribotricer index file
May 02 11:28:16 ... started inferring experimental design
[W::hts_idx_load3] The index file is older than the data file: results/squire/Aza3-A/Aza3-A.bam.bai
May 02 11:34:20 ... started reading bam file
[W::hts_idx_load3] The index file is older than the data file: results/squire/Aza3-A/Aza3-A.bam.bai
0%| | 0/196983294 [00:00<?, ?reads/s][W::hts_idx_load3] The index file is older than the data file: results/squire/Aza3-A/Aza3-A.bam.bai
May 02 11:44:47 ... started plotting read length distribution
May 02 11:44:48 ... started calculating metagene profiles. This may take a long time...
May 02 11:44:48 ... started plotting metagene profiles
May 02 11:44:48 ... started inferring P-site offsets
WARNING: no periodic read length found... using cutoff 0.418
Hi @singhbhavya,
Problem: When I run ribotricer prepare-orfs, all the candidate ORFs correspond with gene names, which confuses me. I don't see many unannotated regions that don't have names.
How are your TE elements annotated in the gtf?
Question: Does RiboTricer predict ALL candidate ORFs (including repeat elements)? If so, how does it name them? As long as all the predicted ORFs have a start end end site, that's all I need, since I
Yes, but it is dependent on how you generate the index (and thereby what was your input gtf). It would be helpful to know what your annotation (gtf) looks like.
Hi Sir! Please see here - https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/mm39_rmsk_TE.gtf.gz
hi @singhbhavya,
Wehn I generate the index using the gtf you shared, it looks okay to me:
ORF_ID ORF_type transcript_id transcript_type gene_id gene_name gene_type chrom strand start_codon coordinate
Lx2B2_8387999_8388064_66 novel Lx2B2 assumed_protein_coding Lx2B2 Lx2B2:TE assumed_protein_coding chr1 + AAG 838799
9-8388064
Lx2B2_8388080_8388172_93 novel Lx2B2 assumed_protein_coding Lx2B2 Lx2B2:TE assumed_protein_coding chr1 + GTG 838808
0-8388172
Have you tried learning the cutoff
first?
My index looks the same as the one you generated; I don't think there's a problem with the index. I'm providing a few more details, in case it's helpful. This is what the bam_summary.txt looks like .
summary:
total_reads: 196983294
unique_mapped: 42718270
qcfail: 0
duplicate: 0
secondary: 137879498
unmapped:0
multi:16385526
length dist:
12: 7243
13: 13751
14: 18484
15: 158495
16: 3939702
17: 955860
18: 278908
19: 326656
20: 299653
21: 223939
22: 152389
23: 151489
24: 159219
25: 267425
26: 718551
27: 1896587
28: 2635753
29: 3829454
30: 10824078
31: 3964525
32: 2213332
33: 2675884
34: 1493565
35: 1063974
36: 618872
37: 1505331
38: 814373
39: 595494
40: 320907
41: 162713
42: 157413
43: 107373
44: 72974
45: 37687
46: 17676
47: 9595
48: 5303
49: 2835
50: 1379
What's super unusual is the ribotricer_protocol, which checks only 4 reads. For the other (gene) annotation, it checks far more than 4.
In total 4 reads checked:
Number of reads explained by "++, --": 2 (0.5000)
Number of reads explained by "+-, -+": 2 (0.5000)
I have RNA-seq data available to me for these samples, and I'll try learning the cutoff. Do you think the cutoff should be different for TE reads compared to gene?
Thank you so, so much for your help so far!! It is very appreciated.
Hi @singhbhavya, the read distribution shows a high enrichment of 29-31nt reads which is great. The protocol inference seems a bit concerning (could be a bug). If you point me to the bam file (or a subset of it), I will be happy to take a deeper look. thanks!
Hi there, Thank you so much for your help! Let me know where I can email you a BAM file. The BAM was aligned to the mm39 genome with the GTF I sent above. I can send you my ribotricer index, the BAM file, and all the commands I used to generate the genome index and do the alignments.
Hi there, Apologies for the second comment, but I've noticed another thing regarding the protocol inference. In infer_protocol.py, I noticed this line:
for read in bam.fetch(until_eof=True):
This means that the program is reading the entire file & all reads in the file to infer the protocol, right? But in my case, I see that when i use the gene GTF, it only reads 20,000 reads:
(riboorf) [singhb22@li03c02 ribotricer]$ head *A/*_ribotricer_protocol.txt
==> Aza1-A/Aza1-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
Number of reads explained by "++, --": 10002 (0.5000)
Number of reads explained by "+-, -+": 10003 (0.5000)
==> Aza2-A/Aza2-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
Number of reads explained by "++, --": 10012 (0.5005)
Number of reads explained by "+-, -+": 9993 (0.4995)
==> Aza3-A/Aza3-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
Number of reads explained by "++, --": 10035 (0.5016)
Number of reads explained by "+-, -+": 9970 (0.4984)
==> UT1-A/UT1-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
Number of reads explained by "++, --": 10006 (0.5002)
Number of reads explained by "+-, -+": 9999 (0.4998)
==> UT2-A/UT2-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
Number of reads explained by "++, --": 10005 (0.5001)
Number of reads explained by "+-, -+": 10000 (0.4999)
==> UT3-A/UT3-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
Number of reads explained by "++, --": 10007 (0.5002)
Number of reads explained by "+-, -+": 9998 (0.4998)
And when I used the TE GTF/annotation, it reads 4 reads:
(riboorf) [singhb22@li03c02 ribotricer]$ head *_TE/*_ribotricer_protocol.txt
==> Aza1-A_TE/Aza1-A_ribotricer_protocol.txt <==
In total 4 reads checked:
Number of reads explained by "++, --": 2 (0.5000)
Number of reads explained by "+-, -+": 2 (0.5000)
==> Aza2-A_TE/Aza2-A_ribotricer_protocol.txt <==
In total 4 reads checked:
Number of reads explained by "++, --": 2 (0.5000)
Number of reads explained by "+-, -+": 2 (0.5000)
==> Aza3-A_TE/Aza3-A_ribotricer_protocol.txt <==
In total 4 reads checked:
Number of reads explained by "++, --": 2 (0.5000)
Number of reads explained by "+-, -+": 2 (0.5000)
==> UT1-A_TE/UT1-A_ribotricer_protocol.txt <==
In total 4 reads checked:
Number of reads explained by "++, --": 2 (0.5000)
Number of reads explained by "+-, -+": 2 (0.5000)
==> UT2-A_TE/UT2-A_ribotricer_protocol.txt <==
In total 4 reads checked:
Number of reads explained by "++, --": 2 (0.5000)
Number of reads explained by "+-, -+": 2 (0.5000)
==> UT3-A_TE/UT3-A_ribotricer_protocol.txt <==
In total 4 reads checked:
Number of reads explained by "++, --": 2 (0.5000)
Number of reads explained by "+-, -+": 2 (0.5000)
Hi there, Apologies for the second comment, but I've noticed another thing regarding the protocol inference. In infer_protocol.py, I noticed this line:
for read in bam.fetch(until_eof=True):
This means that the program is reading the entire file & all reads in the file to infer the protocol, right? But in my case, I see that when i use the gene GTF, it only reads 20,000 reads:
(riboorf) [singhb22@li03c02 ribotricer]$ head *A/*_ribotricer_protocol.txt ==> Aza1-A/Aza1-A_ribotricer_protocol.txt <== In total 20005 reads checked: Number of reads explained by "++, --": 10002 (0.5000) Number of reads explained by "+-, -+": 10003 (0.5000) ==> Aza2-A/Aza2-A_ribotricer_protocol.txt <== In total 20005 reads checked: Number of reads explained by "++, --": 10012 (0.5005) Number of reads explained by "+-, -+": 9993 (0.4995) ==> Aza3-A/Aza3-A_ribotricer_protocol.txt <== In total 20005 reads checked: Number of reads explained by "++, --": 10035 (0.5016) Number of reads explained by "+-, -+": 9970 (0.4984) ==> UT1-A/UT1-A_ribotricer_protocol.txt <== In total 20005 reads checked: Number of reads explained by "++, --": 10006 (0.5002) Number of reads explained by "+-, -+": 9999 (0.4998) ==> UT2-A/UT2-A_ribotricer_protocol.txt <== In total 20005 reads checked: Number of reads explained by "++, --": 10005 (0.5001) Number of reads explained by "+-, -+": 10000 (0.4999) ==> UT3-A/UT3-A_ribotricer_protocol.txt <== In total 20005 reads checked: Number of reads explained by "++, --": 10007 (0.5002) Number of reads explained by "+-, -+": 9998 (0.4998)
It reads first 20000 reads by default, so that looks correct.
And when I used the TE GTF/annotation, it reads 4 reads:
(riboorf) [singhb22@li03c02 ribotricer]$ head *_TE/*_ribotricer_protocol.txt ==> Aza1-A_TE/Aza1-A_ribotricer_protocol.txt <== In total 4 reads checked: Number of reads explained by "++, --": 2 (0.5000) Number of reads explained by "+-, -+": 2 (0.5000) ==> Aza2-A_TE/Aza2-A_ribotricer_protocol.txt <== In total 4 reads checked: Number of reads explained by "++, --": 2 (0.5000) Number of reads explained by "+-, -+": 2 (0.5000) ==> Aza3-A_TE/Aza3-A_ribotricer_protocol.txt <== In total 4 reads checked: Number of reads explained by "++, --": 2 (0.5000) Number of reads explained by "+-, -+": 2 (0.5000) ==> UT1-A_TE/UT1-A_ribotricer_protocol.txt <== In total 4 reads checked: Number of reads explained by "++, --": 2 (0.5000) Number of reads explained by "+-, -+": 2 (0.5000) ==> UT2-A_TE/UT2-A_ribotricer_protocol.txt <== In total 4 reads checked: Number of reads explained by "++, --": 2 (0.5000) Number of reads explained by "+-, -+": 2 (0.5000) ==> UT3-A_TE/UT3-A_ribotricer_protocol.txt <== In total 4 reads checked: Number of reads explained by "++, --": 2 (0.5000) Number of reads explained by "+-, -+": 2 (0.5000)
If you can send me a compressed archive of all the necessary files at saketc@iitb.ac.in, I can take a look.
Closing this as I haven't heard back.
Hi there! I have a question about how RiboTricer handles unannotated sequences.
Purpose: I am trying to run RiboTricer after generating BAM files using SQUIRE (which does both gene and transposable element quantification), and I want to know which TE reads in my ribo-seq data are actively translating.
Problem: When I run
ribotricer prepare-orfs
, all the candidate ORFs correspond with gene names, which confuses me. I don't see many unannotated regions that don't have names.Question: Does RiboTricer predict ALL candidate ORFs (including repeat elements)? If so, how does it name them? As long as all the predicted ORFs have a start end end site, that's all I need, since I have a TE annotation. I have a .txt repeatmasker file with specific TE sequences; can I somehow use that within RiboTricer?