smithlabcode / ribotricer

A tool for accurately detecting actively translating ORFs from Ribo-seq data
http://doi.org/djv4
GNU General Public License v3.0
28 stars 8 forks source link

Queries regarding Ribo-Seq analysis #91

Closed HiteshKore closed 2 years ago

HiteshKore commented 2 years ago

Hi @saketkc, I have a few queries regarding Ribo-Seq analysis, which are as follows:-

1) As most ribosomal footprints are 28-30nt, multi-mapping reads are of major concern. I am curious to know which alignment approach from below is more suitable for typical Ribo-seq data analysis. i) Genome-based: Aligning the rRNA and tRNA depleted library to the genome using a suitable aligner (i.e. STAR), followed by considering the uniquely mapped reads. ii) Transcript-based: Aligning the rRNA and tRNA depleted library to the pre-build transcriptome index using a suitable aligner (i.e. STAR). If we consider the uniquely mapped reads, most of the reads aligned on common exons of transcripts isoforms will be filtered.
Please advise in this regard.

2) How does the RiboTricer assign reads mapped to shared exons of two different transcripts isoforms?

3) How does the Ribotricer deal with the reads mapped at exon junctions?

Thank you in advance for your valuable suggestions.

Kind regards, Hitesh

saketkc commented 2 years ago

As most ribosomal footprints are 28-30nt, multi-mapping reads are of major concern. I am curious to know which alignment approach from below is more suitable for typical Ribo-seq data analysis. i) Genome-based: Aligning the rRNA and tRNA depleted library to the genome using a suitable aligner (i.e. STAR), followed by considering the uniquely mapped reads. ii) Transcript-based: Aligning the rRNA and tRNA depleted library to the pre-build transcriptome index using a suitable aligner (i.e. STAR). If we consider the uniquely mapped reads, most of the reads aligned on common exons of transcripts isoforms will be filtered. Please advise in this regard.

I have always aligned to an annotated transcriptome and that is what I would advise. It is not clear what the advantage of (i) would be. You can align to rRNA and tRNA index first and only map the unmapped reads to the mRNA transcriptome.

How does the RiboTricer assign reads mapped to shared exons of two different transcripts isoforms?

This is a limitation of ribotricer - but we provide a score per ORF per isoforom (though it is really hard to pinpoint which isoform is really translating). One proxy I have used in the past is to select the isoform which has higher expression (in terms of number of RPFs detected per unit length) to pick the one which most likely translating.

How does the Ribotricer deal with the reads mapped at exon junctions?

They are treated like any other reads - I am not sure they need to be treated any differently, unless I am missing something in my understanding of the question?

HiteshKore commented 2 years ago

Thank you @saketkc for elaborative answers. I really appreciate.

Am I correct to say Ribotricer assigns the reads to both transcripts ORFs if they are mapped on common exon of transcripts isoforms?

Will it be a good strategy to consider only uniquely mapped reads with no mismatches in the transcriptome based alignment step itself? In that way, I won't have to be dependent on Ribotricer for multi-mapped reads.

I also want to calculate the translational efficiency (TE) and ribosome release score (RRS) using matched RNA-Seq data. If I only consider the uniquely mapped reads for both RNA-Seq and Ribo-Seq and calculate the TE and RRS will that be a fair strategy?

Thanks, Hitesh

HiteshKore commented 2 years ago

Hi @saketkc,

I was referring to a supplementary PDF file provided with the RiboTricer paper, where the parameters used for aligning the SRA data using STAR aligner are mentioned (heading 2: Obtaining and pre-processing data). I want to know whether you used transcriptome index for the alignment?

Please refer to the below text from the supplementary file-

All the Ribo-seq and RNA-seq data were mapped using STAR [11] by allowing at most two mismatches (--outFilterMismatchNmax 2) and forcing end-to-end (--alignEndsType EndToEnd) read alignment. Only uniquely mapping reads were retained (-outFilterMultimapNmax 1). For human and mouse, we relied on the GENCODE [16] GTF for annotation.

I believe, considering the unique reads will filter most of the reads aligned on shared exons of the different transcripts? In that case, reads aligned on transcripts with common starts codon will also be filtered. Could you please provide an explanation for this? (Please see attached UCSC snapshot of a FGFR gene for your reference.)

Thank you,

Kind regards, Hitesh

Common_start_site

saketkc commented 2 years ago

We aligned to the genome with a gtf as the guide annotation (In my previous comment where I wrote aligned transcriptome, this is what I meant - in one terminology mapping to genome without gtf is callled mapping to genome and with gtf is called mapping to annotated transcriptome). So the bam is aligned to the genome and you can use the gtf to access per isoform codon-wise distribution. Also, the start codon for FGFR1 should be around position 38,47,000, right?

Hope that helps!

saketkc commented 2 years ago

Thank you @saketkc for elaborative answers. I really appreciate.

Am I correct to say Ribotricer assigns the reads to both transcripts ORFs if they are mapped on common exon of transcripts isoforms?

Since the mapping coordinates are on the genome, if you have a gtf, you can always get the read distribution over an isoform. Without any secondary data, it is hard to say which of the isoform is under translation, but a good proxy is the read distribution density (higher density => more likelihood of translation).

Will it be a good strategy to consider only uniquely mapped reads with no mismatches in the transcriptome based alignment step itself? In that way, I won't have to be dependent on Ribotricer for multi-mapped reads.

I am not sure I understand the point about multi-mapped reads. We do not use multi-mapped reads for calculating the score.

I also want to calculate the translational efficiency (TE) and ribosome release score (RRS) using matched RNA-Seq data. If I only consider the uniquely mapped reads for both RNA-Seq and Ribo-Seq and calculate the TE and RRS will that be a fair strategy?

Yes, that is what we do. For example see this paper

HiteshKore commented 2 years ago

We aligned to the genome with a gtf as the guide annotation (In my previous comment where I wrote aligned transcriptome, this is what I meant - in one terminology mapping to genome with gtf is callled mapping to genome and with gtf is called mapping to annotated transcriptome). So the bam is aligned to the genome and you can use the gtf to access per isoform codon-wise distribution. Hope that helps!

Thanks, @saketkc, for clarifying all my doubts genome and transcriptome reference. Now, I am confident about my analysis.

Also, the start codon for FGFR1 should be around position 38,47,000, right? That's correct. Sorry for sharing the wrong snapshot.

Thank you @saketkc for elaborative answers. I really appreciate. Am I correct to say Ribotricer assigns the reads to both transcripts ORFs if they are mapped on common exon of transcripts isoforms?

Since the mapping coordinates are on the genome, if you have a gtf, you can always get the read distribution over an isoform. Without any secondary data, it is hard to say which of the isoform is under translation, but a good proxy is the read distribution density (higher density => more likelihood of translation).

Will it be a good strategy to consider only uniquely mapped reads with no mismatches in the transcriptome based alignment step itself? In that way, I won't have to be dependent on Ribotricer for multi-mapped reads. I am not sure I understand the point about multi-mapped reads. We do not use multi-mapped reads for calculating the score.

I was talking in the context of index built based on transcript sequences. But, I now have a clarity of what do mean by transcriptome based alignment. Your comprehensive response clarified all my doubts. Thanks for that.

I also want to calculate the translational efficiency (TE) and ribosome release score (RRS) using matched RNA-Seq data. If I only consider the uniquely mapped reads for both RNA-Seq and Ribo-Seq and calculate the TE and RRS will that be a fair strategy? Yes, that is what we do. For example see this paper

Thanks for sharing this article. It's really helpful!

HiteshKore commented 2 years ago

Hi Sanket, I want to generate TPM counts out of raw RPFs outputted from RiboTricer. I am using following link to convert raw counts into TPM.

https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/ Step followed:

  1. Divided the read counts by the length of each transcript in kilobases.
  2. Summed up all the RPK values in a sample and this number was divided by 1,000,000.
  3. Divided the RPK values by the “per million” scaling factor.

I was wondering if I should divide the raw RFP count by ORF length instead of transcript length in step 1. What would be the right approach? Are there any standard packages to convert raw counts into TPM values?

Any help would be greatly appreciated.