Questions regarding training and applying offsets.

ericmalekos commented 3 months ago

Hi I have a few questions related to best practices here.

1) I have 5 replicates that I merged into 1 bam. The merged bam has ~300M alignments, including secondary alignments. Should I train on this entire bam or what percentage/#of reads should I train on? Should I discard secondary alignments first?

2) Based on the ribotish quality the 28nt reads have good periodicity, ~85% frame1, while the 27nt reads have bad periodicity, ~50% frame1 ~50% frame2. Should I add a uniform offset to the 28nt reads and learned offsets for the 27nt reads, or should I add learned offsets for all? I guess I would worry that since the 28nt reads are already good, adding variable offsets might dampen that signal.

3) If I want to use this for the Ribotish should I train on transcriptome alignment or genome alignment, or does it not matter?

thank you!

mt1022 commented 3 months ago

Hi, thanks for your interest.

I suggest using all data for training. down-sample only when the training step takes too long to finish.
The quality of your data seems exceptional. If most reads in you library are of 28 nt, it is fine to discard other read lengths and use a uniform offset for the 28 nt reads. Regarding your concerns, if the phase of that read length is already very high, proportion of in-frame reads will be similar after training based on our experience.
Training is always performed with the transcriptome alignment, while prediction can be performed with either.

ericmalekos commented 3 months ago

Thank you for the help!

The reads are split evenly between 28 nts with strong periodicity and 27 nts with 50% periodicity

gxelab / psite

Questions regarding training and applying offsets. #3