Unannotated ORFs - Githubissues

singhbhavya commented 1 month ago

Hi there! I have a question about how RiboTricer handles unannotated sequences.

Purpose: I am trying to run RiboTricer after generating BAM files using SQUIRE (which does both gene and transposable element quantification), and I want to know which TE reads in my ribo-seq data are actively translating.

Problem: When I run ribotricer prepare-orfs, all the candidate ORFs correspond with gene names, which confuses me. I don't see many unannotated regions that don't have names.

Question: Does RiboTricer predict ALL candidate ORFs (including repeat elements)? If so, how does it name them? As long as all the predicted ORFs have a start end end site, that's all I need, since I have a TE annotation. I have a .txt repeatmasker file with specific TE sequences; can I somehow use that within RiboTricer?

singhbhavya commented 1 month ago

Hi there! I re-ran RiboTricer with a TE annotation (gtf). All the samples ran up until this point. I'm having trouble making sense of this error message; please let me know if you could help! The "index" file warning is not an issue, because the same samples ran with the same warning when the gtf was different. It seems the program fails after "WARNING: no periodic read length found... using cutoff 0.418", but no other warnings or errors are reported. In addition, all the reported files and graphs (except for read_length_dist) are empty. The metagene_profile tables are all also exclusively populated with zeroes.

(ribotricer_env) $ ribotricer detect-orfs --bam results/squire/Aza3-A/Aza3-A.bam --ribotricer_index results/ribotricer/ribotricer_candidate_orfs_TE.tsv --prefix results/ribotricer/Aza3-A_TE/Aza3-A_ribotricer --phase_score_cutoff 0.418
May 02 11:28:16 ..... started ribotricer detect-orfs
May 02 11:28:16 ... started parsing ribotricer index file
May 02 11:28:16 ... started inferring experimental design
[W::hts_idx_load3] The index file is older than the data file: results/squire/Aza3-A/Aza3-A.bam.bai
May 02 11:34:20 ... started reading bam file
[W::hts_idx_load3] The index file is older than the data file: results/squire/Aza3-A/Aza3-A.bam.bai
  0%|                                                                                                                                                                                                         | 0/196983294 [00:00<?, ?reads/s][W::hts_idx_load3] The index file is older than the data file: results/squire/Aza3-A/Aza3-A.bam.bai
May 02 11:44:47 ... started plotting read length distribution                                                                                                                                                                                  
May 02 11:44:48 ... started calculating metagene profiles. This may take a long time...

May 02 11:44:48 ... started plotting metagene profiles
May 02 11:44:48 ... started inferring P-site offsets
WARNING: no periodic read length found... using cutoff 0.418

saketkc commented 1 month ago

Hi @singhbhavya,

Problem: When I run ribotricer prepare-orfs, all the candidate ORFs correspond with gene names, which confuses me. I don't see many unannotated regions that don't have names.

How are your TE elements annotated in the gtf?

Question: Does RiboTricer predict ALL candidate ORFs (including repeat elements)? If so, how does it name them? As long as all the predicted ORFs have a start end end site, that's all I need, since I

Yes, but it is dependent on how you generate the index (and thereby what was your input gtf). It would be helpful to know what your annotation (gtf) looks like.

singhbhavya commented 1 month ago

Hi Sir! Please see here - https://labshare.cshl.edu/shares/mhammelllab/www-data/TEtranscripts/TE_GTF/mm39_rmsk_TE.gtf.gz

saketkc commented 1 month ago

hi @singhbhavya,

Wehn I generate the index using the gtf you shared, it looks okay to me:

ORF_ID  ORF_type        transcript_id   transcript_type gene_id gene_name       gene_type       chrom   strand  start_codon     coordinate
Lx2B2_8387999_8388064_66        novel   Lx2B2   assumed_protein_coding  Lx2B2   Lx2B2:TE        assumed_protein_coding  chr1    +       AAG     838799
9-8388064
Lx2B2_8388080_8388172_93        novel   Lx2B2   assumed_protein_coding  Lx2B2   Lx2B2:TE        assumed_protein_coding  chr1    +       GTG     838808
0-8388172

Have you tried learning the cutoff first?

singhbhavya commented 1 month ago

My index looks the same as the one you generated; I don't think there's a problem with the index. I'm providing a few more details, in case it's helpful. This is what the bam_summary.txt looks like .

summary:
        total_reads: 196983294
        unique_mapped: 42718270
        qcfail: 0
        duplicate: 0
        secondary: 137879498
        unmapped:0
        multi:16385526

length dist:
        12: 7243
        13: 13751
        14: 18484
        15: 158495
        16: 3939702
        17: 955860
        18: 278908
        19: 326656
        20: 299653
        21: 223939
        22: 152389
        23: 151489
        24: 159219
        25: 267425
        26: 718551
        27: 1896587
        28: 2635753
        29: 3829454
        30: 10824078
        31: 3964525
        32: 2213332
        33: 2675884
        34: 1493565
        35: 1063974
        36: 618872
        37: 1505331
        38: 814373
        39: 595494
        40: 320907
        41: 162713
        42: 157413
        43: 107373
        44: 72974
        45: 37687
        46: 17676
        47: 9595
        48: 5303
        49: 2835
        50: 1379

What's super unusual is the ribotricer_protocol, which checks only 4 reads. For the other (gene) annotation, it checks far more than 4.

In total 4 reads checked:
        Number of reads explained by "++, --": 2 (0.5000)
        Number of reads explained by "+-, -+": 2 (0.5000)

I have RNA-seq data available to me for these samples, and I'll try learning the cutoff. Do you think the cutoff should be different for TE reads compared to gene?

Thank you so, so much for your help so far!! It is very appreciated.

saketkc commented 1 month ago

Hi @singhbhavya, the read distribution shows a high enrichment of 29-31nt reads which is great. The protocol inference seems a bit concerning (could be a bug). If you point me to the bam file (or a subset of it), I will be happy to take a deeper look. thanks!

singhbhavya commented 1 month ago

Hi there, Thank you so much for your help! Let me know where I can email you a BAM file. The BAM was aligned to the mm39 genome with the GTF I sent above. I can send you my ribotricer index, the BAM file, and all the commands I used to generate the genome index and do the alignments.

singhbhavya commented 1 month ago

Hi there, Apologies for the second comment, but I've noticed another thing regarding the protocol inference. In infer_protocol.py, I noticed this line:

for read in bam.fetch(until_eof=True):

This means that the program is reading the entire file & all reads in the file to infer the protocol, right? But in my case, I see that when i use the gene GTF, it only reads 20,000 reads:

(riboorf) [singhb22@li03c02 ribotricer]$ head *A/*_ribotricer_protocol.txt
==> Aza1-A/Aza1-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
    Number of reads explained by "++, --": 10002 (0.5000)
    Number of reads explained by "+-, -+": 10003 (0.5000)

==> Aza2-A/Aza2-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
    Number of reads explained by "++, --": 10012 (0.5005)
    Number of reads explained by "+-, -+": 9993 (0.4995)

==> Aza3-A/Aza3-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
    Number of reads explained by "++, --": 10035 (0.5016)
    Number of reads explained by "+-, -+": 9970 (0.4984)

==> UT1-A/UT1-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
    Number of reads explained by "++, --": 10006 (0.5002)
    Number of reads explained by "+-, -+": 9999 (0.4998)

==> UT2-A/UT2-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
    Number of reads explained by "++, --": 10005 (0.5001)
    Number of reads explained by "+-, -+": 10000 (0.4999)

==> UT3-A/UT3-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
    Number of reads explained by "++, --": 10007 (0.5002)
    Number of reads explained by "+-, -+": 9998 (0.4998)

And when I used the TE GTF/annotation, it reads 4 reads:

(riboorf) [singhb22@li03c02 ribotricer]$ head *_TE/*_ribotricer_protocol.txt
==> Aza1-A_TE/Aza1-A_ribotricer_protocol.txt <==
In total 4 reads checked:
    Number of reads explained by "++, --": 2 (0.5000)
    Number of reads explained by "+-, -+": 2 (0.5000)

==> Aza2-A_TE/Aza2-A_ribotricer_protocol.txt <==
In total 4 reads checked:
    Number of reads explained by "++, --": 2 (0.5000)
    Number of reads explained by "+-, -+": 2 (0.5000)

==> Aza3-A_TE/Aza3-A_ribotricer_protocol.txt <==
In total 4 reads checked:
    Number of reads explained by "++, --": 2 (0.5000)
    Number of reads explained by "+-, -+": 2 (0.5000)

==> UT1-A_TE/UT1-A_ribotricer_protocol.txt <==
In total 4 reads checked:
    Number of reads explained by "++, --": 2 (0.5000)
    Number of reads explained by "+-, -+": 2 (0.5000)

==> UT2-A_TE/UT2-A_ribotricer_protocol.txt <==
In total 4 reads checked:
    Number of reads explained by "++, --": 2 (0.5000)
    Number of reads explained by "+-, -+": 2 (0.5000)

==> UT3-A_TE/UT3-A_ribotricer_protocol.txt <==
In total 4 reads checked:
    Number of reads explained by "++, --": 2 (0.5000)
    Number of reads explained by "+-, -+": 2 (0.5000)

saketkc commented 1 month ago

Hi there, Apologies for the second comment, but I've noticed another thing regarding the protocol inference. In infer_protocol.py, I noticed this line:

for read in bam.fetch(until_eof=True):

This means that the program is reading the entire file & all reads in the file to infer the protocol, right? But in my case, I see that when i use the gene GTF, it only reads 20,000 reads:

(riboorf) [singhb22@li03c02 ribotricer]$ head *A/*_ribotricer_protocol.txt
==> Aza1-A/Aza1-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
  Number of reads explained by "++, --": 10002 (0.5000)
  Number of reads explained by "+-, -+": 10003 (0.5000)

==> Aza2-A/Aza2-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
  Number of reads explained by "++, --": 10012 (0.5005)
  Number of reads explained by "+-, -+": 9993 (0.4995)

==> Aza3-A/Aza3-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
  Number of reads explained by "++, --": 10035 (0.5016)
  Number of reads explained by "+-, -+": 9970 (0.4984)

==> UT1-A/UT1-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
  Number of reads explained by "++, --": 10006 (0.5002)
  Number of reads explained by "+-, -+": 9999 (0.4998)

==> UT2-A/UT2-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
  Number of reads explained by "++, --": 10005 (0.5001)
  Number of reads explained by "+-, -+": 10000 (0.4999)

==> UT3-A/UT3-A_ribotricer_protocol.txt <==
In total 20005 reads checked:
  Number of reads explained by "++, --": 10007 (0.5002)
  Number of reads explained by "+-, -+": 9998 (0.4998)

It reads first 20000 reads by default, so that looks correct.

And when I used the TE GTF/annotation, it reads 4 reads:

(riboorf) [singhb22@li03c02 ribotricer]$ head *_TE/*_ribotricer_protocol.txt
==> Aza1-A_TE/Aza1-A_ribotricer_protocol.txt <==
In total 4 reads checked:
  Number of reads explained by "++, --": 2 (0.5000)
  Number of reads explained by "+-, -+": 2 (0.5000)

==> Aza2-A_TE/Aza2-A_ribotricer_protocol.txt <==
In total 4 reads checked:
  Number of reads explained by "++, --": 2 (0.5000)
  Number of reads explained by "+-, -+": 2 (0.5000)

==> Aza3-A_TE/Aza3-A_ribotricer_protocol.txt <==
In total 4 reads checked:
  Number of reads explained by "++, --": 2 (0.5000)
  Number of reads explained by "+-, -+": 2 (0.5000)

==> UT1-A_TE/UT1-A_ribotricer_protocol.txt <==
In total 4 reads checked:
  Number of reads explained by "++, --": 2 (0.5000)
  Number of reads explained by "+-, -+": 2 (0.5000)

==> UT2-A_TE/UT2-A_ribotricer_protocol.txt <==
In total 4 reads checked:
  Number of reads explained by "++, --": 2 (0.5000)
  Number of reads explained by "+-, -+": 2 (0.5000)

==> UT3-A_TE/UT3-A_ribotricer_protocol.txt <==
In total 4 reads checked:
  Number of reads explained by "++, --": 2 (0.5000)
  Number of reads explained by "+-, -+": 2 (0.5000)

If you can send me a compressed archive of all the necessary files at saketc@iitb.ac.in, I can take a look.

saketkc commented 1 week ago

Closing this as I haven't heard back.

smithlabcode / ribotricer

Unannotated ORFs #161