jmbreda / Sanity

Filtering of Poison noise on a single-cell RNA-seq UMI count matrix
GNU General Public License v3.0
65 stars 11 forks source link

interpretation of log transcription quotients #17

Closed maximelepetit closed 2 years ago

maximelepetit commented 2 years ago

How can I interpret log-transcription quotient? Because their negative value, so they cannot be interpreted as counts ? Because most of tools such as clustering algoriths or CellphoneDB or Single R don't work with negative value and work mostly with log-transform counts.

jmbreda commented 2 years ago

Hi maximelepetit,

Thanks for your question.

We realized the log transcription quotient can be a bit confusing. We just updated the Readme to try and better explain it:

The LTQ x_gc of gene g in cell c corresponds to the estimated logarithm of the fraction of mRNAs in cell c that belong to gene g. The LTQs are thus normalized such that Σg exp(x_gc) = 1 for each cell c. In order to get an estimate of the number of mRNAs count for gene g in cell c one would thus need to multiply exp(x_gc) by the estimated total number of mRNAs M in the cell.

So what I would typically do is compute M as the median total mRNA per cell, and add log(M) to every LTQ. It corresponds to multiplying the fraction of mRNA that belong to each gene in a cell (transcription quotient) by M. let's denote the transciption quotient f_gc, we have:

log-transform-count = log( f_gc*M ) = log(f_gc) + log(M) = LTQ + log(M)

I hope this helps!

Jeremie

maximelepetit commented 2 years ago

Thanks for the answer . That's help me a lot and clarify my ideas.

I'm also a little bit confused:

The estimated total number of mRNAs M in the cell correspond to Σg exp(x_gc)? So if Σg exp(x_gc) = 1 and we multiply exp(x_gc) by the estimated total number of mRNAs M in the cell, the estimate of the number of mRNAs count for gene g in cell c is equal to exp(x_gc)?

I'm a little bit confusing about the matrix i need to compute the median total mRNA per cell. So to compute log-transform-count i need to calculate the median of each column of the transcription quotient(Exp(LTQ)) matrix, take the logarithme of this median, and add it to each value in the corresponding column in the LTQ matrix ?

Maxime

maximelepetit commented 2 years ago

Sorry i was wrong when i say log-transform counts in my first comment, i want to mean log-normalize count that correspond to normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result

jmbreda commented 2 years ago

Hi Maxime,

The ltq x_gc represent, in each cell, the log of fraction of transcript that are "allocated" to each gene, so the sum of this fraction is 1 Σg exp(x_gc) = 1. If you want not a estimated fraction of count but an estimated normalize count, you can multiply exp(x_gc) by M, where M is the estimated number of count per cell. So you have Σg exp(x_gc)M = M, and each cell's normalize count sums to M. To get the log-normalized count you take the log of that: log-normalize count = log( exp(x_gc)M) = log(x_gt) + log(M) = LTQ + log(M) where LTQ is the output of Sanity.

Now, to get the estimated total count per cell, I proposed to take the median of the total count per cell. That you can compute on the raw UMI matrix which was given as input when running Sanity, so if n_gc is the umi count for gene g in cell c it would be M = median( Σg n_gc ) But of course you can use another scale factor like M=10'000 which I think is in the typical range of a total count per cell in a scRNAseq experiment.

Does this clarify the confusion?

Best, Jeremie

maximelepetit commented 2 years ago

Hi, Yes this clarify very much the confusion. Thanks a lot !!

Last question : to get the estimated total count per cell, you proposed to take the median of the total count per cell but in my case for all cells the median value is equal to 0. Can i estimate the total count by cell with another methods such as the mean of the total count per cell ?

Best,

Maxime

jmbreda commented 2 years ago

Hi,

Yes, you can also use the mean total count per cell. In my memory, the distribution of total count per cell is very asymmetric that's why the median might be more appropriate.

Anyway it sounds very strange to me to obtain a median total count per cell of 0. It means that at least half your cell do not have a single count. Do you have that many "empty" cells that somehow do not have any count. I think typical pre-processing of scRNA sequencing reads do filter out cells with low total counts. Another reason I could think of, maybe you're calculating the median count per gene and per cell (median(n_gc)) instead of the median total count per cell (median(sum_g n_gc)) because certainly in a scRNAseq experiment, the majority of all umi counts n_gc are 0.

maximelepetit commented 2 years ago

Hi,

The problem is that i compute the median count per gene and per cell (median(n_gc)) instead of the median total count per cell (median(sum_g n_gc)) . I 've got median total count per cell of 6012, that's better than 0.

Thank you for your answers. I think they will clarify the Sanity's output for future users .

Best,

Maxime

jmbreda commented 2 years ago

Hi Maxime,

Great! Happy that it works. And yes, I think it will help other users.

Best, Jeremie