p-site offsets and metagene profiles

p-levy commented 2 years ago

Hi Saket,

Thanks for developing this great tool! I am surprised to get these results for the psite_offsets.txtoutput file:

relative lag to base: 33
    lag of 28: 0
    lag of 22: 0
    lag of 33: 0
    lag of 27: 0
    lag of 24: 0
    lag of 32: 0
    lag of 34: 0
    lag of 31: 0
    lag of 30: -1
    lag of 25: 0
    lag of 26: 0
    lag of 29: 0
    lag of 35: 0
    lag of 36: 0
    lag of 23: 0
    lag of 21: 0
    lag of 20: 0

How should I interpret that?

And I find my metagene profiles a bit strange. Especially around the stop codons. See this example:

While using RibORF with the same ribo-seq bam file I get this type of profiles showing a p-site offset of 13 nt for both start and stop codon:

Why do you think this is the case and should I then trust the results of my translating_ORFs.tsv output?

Thanks! Pierre

saketkc commented 2 years ago

Thanks for developing this great tool! I am surprised to get these results for the psite_offsets.txtoutput file:
relative lag to base: 33
  lag of 28: 0
  lag of 22: 0
  lag of 33: 0
  lag of 27: 0
  lag of 24: 0
  lag of 32: 0
  lag of 34: 0
  lag of 31: 0
  lag of 30: -1
  lag of 25: 0
  lag of 26: 0
  lag of 29: 0
  lag of 35: 0
  lag of 36: 0
  lag of 23: 0
  lag of 21: 0
  lag of 20: 0
How should I interpret that?

The output indicates relative offsets between different read lengths with 33nt reads treated as the base. Ribotricer tries to choose relative offsets for merging profiles of different read lengths so to maximize the cross-correlation between them, You can override this behavior by providing custom offsets using the --psite_offsets option (https://github.com/smithlabcode/ribotricer/blob/af3bc3c48621f2aa599037edfb817201e6cfd1f7/ribotricer/cli.py#L149).

And I find my metagene profiles a bit strange. Especially around the stop codons. See this example:

While using RibORF with the same ribo-seq bam file I get this type of profiles showing a p-site offset of 13 nt for both start and stop codon:

Why do you think this is the case and should I then trust the results of my translating_ORFs.tsv output?

I am not quite sure why there is bleedthrough over the stop codon - but can likely be arising from annotation issues in the GTF. Can you provide what annotation is this based on (Genome build and GTF version). In our benchmarking, we find our strategy to have both higher sensitivity and specificity, over multiple species.

p-levy commented 2 years ago

Hi Saket,

This was generated from human data.

Genome used: GRCh38.primary_assembly.genome.fa (gencode) GTF: gencode.v39.annotation.gtf

Here's how I ran ribotricer:

1) Prepare ORFs

ribotricer prepare-orfs \
    --gtf /plevy/ref/gencode.v39.annotation.gtf \
    --fasta /plevy/ref/GRCh38.primary_assembly.genome.fa \
    --prefix gencode \
    --min_orf_length 24 \
    --start_codons ATG,TTG,CTG,GTG \
    --longest

2) Detect ORFs

ribotricer detect-orfs \
    --bam sample.bam \
    --ribotricer_index gencode_candidate_orfs.tsv \
    --prefix gencode

Thanks! Pierre

saketkc commented 2 years ago

Thanks, looks correct to me. Actually, it is hard to compare against ribORFs profile without extending ours a few bases beyond the stop codon which is currently 0 (https://github.com/smithlabcode/ribotricer/blob/af3bc3c48621f2aa599037edfb817201e6cfd1f7/ribotricer/metagene.py#L123). I will try to expose this parameter to the cli. But the bleed through in the above visualization does not affect the periodicity calculation so the results remain interpretable.

polklin commented 1 year ago

Hello @saketkc,

thanks for this very nice tool !

I add a comment in this thread because I do find some strange results in the metagene profile on my side too. I added an offset of 20 nucleotides on the 3' end as you suggested, so that I can see a P-site offset on the stop codon too.

I typically observe this kind of plot on my data:

For the start it looks nice, I have a peak ~12 nucleotides before the start codon. But it seems the Frame 3 (in blue) is majoritarian. I would have expected the Frame 1 (orange) instead ?

For the stop codon plot, the Frame 1 (orange) seems nice. But the Frame 2 (green) is way higher and constant on the all plot.

I installed ribotricer via conda (https://anaconda.org/bioconda/ribotricer), release 1.3.2. I am working with GENCODE (hg19) human release. https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/GRCh37.p13.genome.fa.gz

I prepare the ORFS with:

ribotricer prepare-orfs --gtf gencode.v19.annotation.gtf \
                        --fasta GRCh37.p13.genome.fa \
                        --prefix out \
                        --min_orf_length 30 \
                        --start_codons ATG,CTG,GTG

ribotricer detect-orfs \
             --bam input.bam\
             --prefix 'out_' \
             --ribotricer_index candidate_orfs.tsv \
             --phase_score_cutoff 0.440 \
             --report_all

Thanks, Paul

polklin commented 1 year ago

I have also tried with the test data you're providing: https://www.dropbox.com/s/xr0xsdluuni2b95/ribotricer_test_data_tair10.zip

I detected the translating ORF with the command:

ribotricer detect-orfs --bam bams_unique/SRX219170.bam \
                                      --ribotricer_index index/ribotricer_v44_annotation_longest_candidate_orfs.tsv \
                                      --prefix SRX219170_generated

If I understand correctly the code, the read length with the highest number of reads is taken as the reference (p-site shifts will be computed relatively to this reference).

In this test example, it is the 29nt long that is majoritarian. The metagene profile looks like this:

The triplet periodicity seems good, but there is a bias towards the Frame 2 (in green). I would have expected Frame 1 (orange to be majoritarian) ?

When looking at the psite_offset.txt generated file it seems all other read length are shifted towards Frame 2.

    lag of 30: 1
    lag of 28: -1
    lag of 29: 0
    lag of 26: -3
    lag of 31: 1
    lag of 27: -2
    lag of 25: -4
    lag of 24: -5
    lag of 32: 2

Perhaps that is my understanding that is not good enough .

Shouldn't we expect that only read length with a good periodicity + majority of reads on Frame 1 to be kept / considered as the reference for p-site alignment ?

smithlabcode / ribotricer

p-site offsets and metagene profiles #122