tanaylab / metacells

Metacells - Single-cell RNA Sequencing Analysis
MIT License
86 stars 8 forks source link

Meaning of aggregated values in metacells #63

Open annaKett opened 8 months ago

annaKett commented 8 months ago

Hello, thank you for the support you provided recently! I have another question regarding the values we get in the aggregated metacells (after collect_metacells()). I compute the metacells from untransformed count data and expected to obtain 'artificial' aggregated counts data for each metacell, that I can then log transform for further analyses. Can I indeed handle the values like counts? Because the total_count distribution looks quite different from what I would expect for count data, and they are normalized to values between 0 and 1, which again confuses me. Is there any possibility to obtain values that can be used as counts again for the metacells? Maybe if I set metacell_geo_mean=False? Thank you lots & kind regards Anna

orenbenkiki commented 8 months ago

The output of the algorithm is an (hopefully robust) estimate the fraction of each gene's UMIs out of the total UMIs. This factors out the depth of sampling (since we have no good idea on how many UMIs the cells actually have). It also allows for greater precision in describing the relative expression levels of the genes.

There's also the total UMIs of all the gene's UMIs in all the cells grouped into the metacell, but this is not a good estimate as of itself (due to noise and the difference in total UMIs in the different cells).

If you have to, you can convert the fractions to UMIs count by picking an arbitrary total number of UMIs and multiplying by it. However, this will give you fractional UMIs, which will (rightfully) cause indigestion to many algorithms. Simply rounding the values to the nearest integer isn't the best approach (e.g. weak genes would be rounded to zero). A better approach would be to use stochastic rounding (use the less-than-one fraction as a probability).

annaKett commented 8 months ago

Ok, thank you! The authors of the data we're using performed SCRAN normalization to normalize for differences in total UMI counts per cell: they first performed total count normalization, by dividing each count by its cell’s total count and multiplying by 10,000. They then performed a log transformation using natural log and pseudocount 1. The log transformation and pseudo count I am reverting before applying metacells. I would expect that since we normalized for number of UMIs, we are better off in terms of interpreting the values we get from metacells - do you agree? Can we maybe convert the fractions by multiplying them with the aggregate normalized UMI counts by the cells included in the metacell? Thank you lots and best regards Anna

orenbenkiki commented 8 months ago

Well... actually it would be better if you used UMI counts reflecting the total number of UMIs in the original cells, because sampling depth impacts the algorithm (as it should). If you have the sampling depth (total UMIs) of each cell, you could create new UMI counts that add up to that for each cell (instead of using a fixed value).

Also, the metacell algorithm expects the UMI counts to be integers (I know, it insists the data type be float32 - sorry about that - but the actual values must be integers). See my previous comment about rounding.