Unsure of how to generate 'genelist.dat'

zslastman commented 3 years ago

Hi guys - I'm trying to run your software on some Ribo-seq data in developing mouse neocortex. When I generate my own files I end up with an error like so:

INFO: Reading A-site_His_1.dat.
ERROR: Unable to normalise Ribo-seq. Gene skipped.

My genelist.dat file looks like so:

His_1 773.3779527559054

I notice that i can make the error go away if I edit this number down to a much lower number.

This makes me unsure what the value in the second column of genelist.dat should actually be. If I'm reading the README correctly it aught to be average ribosome density for that gene - i.e. the average value of the A_site file. However the value in your example file is 0.02501, and the average value of your A_site count file is 189.75.

How exactly should I calculate the value in genelist.dat?

jszavits commented 3 years ago

Hi, thanks for getting in touch. The explanation is a bit subtle, so please bear with me.

The second column is the total ribosome density per mRNA (let's call it "rho"), which is defined as the total number of ribosomes per transcript per number of codons of that transcript (let's call it "L"); given that the size of the ribosome is about 10 codons in length (but you can use a different number, as this is a parameter), the density "rho" thus cannot be larger than 1/10=0.1, which would correspond to a transctipt that is completely filled with ribosomes.

So, if the program encounters the number in the second column that is larger than 1/(ribosome lenght in codons), it will skip that gene as that is not a physically allowed absolute ribosome density. In contrast, if the number is less the 0.1, then it is used in the program to normalise the A-site reads, so that we get absolute ribosome density at each codon, as follows.

From the Ribo-seq data one gets the number of A-site reads for each codon of the transcript, say "A_i", where "i" is the i-th codon of the transcript. This number is first normalised by the average number of reads for that transcript, which is the sum of "A_i" over all codons "i" divided by the codon number "L". That number is then multiplied by the total absolute density "rho", resulting in the absolute ribosome density for each codon, "rho_i". This "rho_i" is then used as an input for the mechanistic model of translation.

The issue is how to get the absolute ribosome density "rho". You cannot get it solely from Ribo-seq data because Ribo-seq gives you only relative density "rho_i" divided by "rho" (which is the same as A_i divided by the sum of A_i divided by L). In the original paper we used polysome profiling data for "rho", which was available for yeast. An alternative would be to use Ribo-seq data in combination with RNA-seq data, as a proxy for the amount of RNA, but we have not tried that. This issue of normalisation is something we are working on.

Does that help?

zslastman commented 3 years ago

Aha yes! That's very helpful, and makes perfect sense.

I have RNA-seq data available, however what I don't have is absolute measurements of mRNA concentration from spike ins or similiar.

It would be easy to use the mRNA seq measurements and Ribo-seq measurements to come up with a ceiling on absolute ribosome density per mRNA (since as you say we know 0.1 is the maximum physics will permit) which I guess would let the algorithm converge, but I'm wondering what effect it's going to have if all of my densities are scaled incorrectly.

Even if initiation/elongation rates were all off by some constant, the rank ordering would be interesting. I have multiple samples I'd like to compare, and to the mRNA:ribosome certainly differs between between them. If I could compare between samples that would be even better. I observe a peak of ribosome occupancy around the start codon at the end of my time course for some genes, and I'm trying to figure out if it's due to slower elongation in the gene body making the start peak relatively lower.

jszavits commented 3 years ago

It all depends on the amount of ribosome traffic. If you have low ribosome traffic, then the algorithm will predict elongation rates that are inversely proportional to the read count, as if there is no queuing. If you have an increased ribosome traffic in which some ribosomes are queuing, then the inferred elongation rates will in general differ from the inverse read counts. So if your estimate for the density is larger then it really is, the algorithm will predict queuing. If your estimate is lower, then it will predict less queuing. I know of one paper on yeast in which the absolute ribosome density estimated from Ribo-seq/RNA-seq was compared to polysome profiling. If I remember correctly, the two datasets had a very good correlation, which gives me hope that this type of normalisation is not too bad. The paper is: https://doi.org/10.1371/journal.pgen.1007166

To answer your last paragraph: one thing you could try is to run the algorithm on a rangle of absolute densities. The theoretical range is from 0 to 1/(ribosome lenght in codons), so you could try getting results, say for 5%, 10%, 15%, 20%, 25% and 30% of the maximum density (if you put absolute density too big, say bigger then 30%, then the program will not be so reliable, this is explained in the paper). Not sure how many genes you have, if you have thousands of genes that could take a while.

zslastman commented 3 years ago

Thanks again for the detailed and helpful answer. That paper has a lot of useful information for us - we don't have access to the exact same data they did, but we might be able to do something similar with what we have.

I have thousands of genes, indeed. But I guess I could try a range of densities for a subset of genes (in different quartiles of riboseq density say) and then use these to scale the full set, given that Riboseq/mRNA should give me the density up to a scaling constant.

jszavits / NEAR

Unsure of how to generate 'genelist.dat' #3