Comparing gene expression quantified by different schemes after compositional normalization

ahy1221 commented 5 years ago

I am not sure I understand the preprint right. It seems that sleuth-ALR introduce a new normalization method by treating counts as composition data. If so, I was wondering that if I have two dataset, one was quantified by read counts (coverage based), the other was quantified by UMI counts, is that possible to do a direct mean expression comparison for a gene after performing compositional normalization ?

warrenmcg commented 5 years ago

Hi @ahy1221,

The short answer to your question is yes, you can do direct comparisons of any two RNA-Seq datasets after both have been normalized using compositional normalization, provided you used the same feature(s) to normalize both.

However, the interpretation of this comparison depends on answering two additional questions: 1a) Are these two datasets (A) created independently (the second with a protocol that used UMIs, and the first without), or (B) the same dataset analyzed with or without UMIs? 1b) If these are two independent datasets, are there any differences in the biological part of the two experiments, or is this a technical comparison of the same biological experiment using two different protocols?

As you probably know, the whole reason for using UMIs is to minimize counting PCR duplicates, but there is a risk of "over"-deduplicating read counts based on UMI collisions or how the protocol was exactly done.

If both datasets are from identical biological experiments, then this direct comparison will help give you an estimate of differences due to the protocol or just the analysis step, depending on how the datasets were generated.

If the two datasets come from different biological experiments, the comparison is much harder to interpret, because differences may be due to the technical differences (UMIs vs none; library prep protocols; etc.), the differences in the experiment (model system, experiment manipulation, etc), or both. In this situation, it will be difficult or impossible to distinguish differences due to one factor versus the other.

ahy1221 commented 5 years ago

Thank you for so much detailed comments. Is there a way to extract normalized expression matrix using sleuth-ALR ? Or is there a normalization function in this package to apply to any counts matrices generated by different counting schemes (such as featureCounts on gene level directly )

warrenmcg commented 5 years ago

Yes to both questions:

Question 1: How do I extract the normalized expression matrix from my `sleuth-ALR` object?

In sleuth, there is a distinction between the normalization step and the transformation step. In sleuth-ALR, this distinction depends on what the lr_method argument is. If the lr_method is "both", then you can get normalized-but-not-transformed data as well as normalized-and-transformed data. The normalization step is the ratio of each feature to the set of features you chose for the denominator using denom_name. The transformation step is taking the log of these ratios, producing logratios (hence the name "Additive LogRatio", ALR). If the lr_method is "transform", then only the latter is available.

The normalized-but-not-transformed data can be access using the sleuth_to_matrix function in sleuth:

## NOTE: these are only meaningful if `lr_method == "both"`
## Assume your sleuth object is named "so"
norm_counts <- sleuth::sleuth_to_matrix(so, "obs_norm", "est_counts")
norm_tpms <-  sleuth::sleuth_to_matrix(so, "obs_norm", "tpm")

The normalized-and-transformed data is found in the bs_summary list in the sleuth object:

nAndT_counts <- so$bs_summary$obs_counts
nAndT_tpms <- so$bs_summary$obs_tpm

Question 2: Can I apply the normalization function to count matrices generated by other counting schemes?

There is a question about which counting scheme has better performance. However, putting that question to the side, you can certainly transform any count matrix using this approach! To do so, you can use the X_transformation functions to apply any of the available compositional normalization approaches you want: alr_transformation, clr_transformation, and iqlr_transformation. You can check out the documentation for each to get a sense of what you need to specify. I think I might need to tweak the settings to get better defaults, but you should set remove_zeros = TRUE and delta = 0.5 (based on my testing, this seems to be the optimal value for counts; use 0.01 for TPMs). Set denom_name if you are going to use alr_transformation.

Here's an example of using alr_transformation (assume counts is your counts matrix, and denom is the feature you want to use for the normalization):

norm_counts <- sleuthALR::alr_transformation(mat = counts, denom_name = denom, remove_zeros = TRUE, delta = 0.5)

Please NOTE: this function assumes that there are more features than there are samples, and will give unexpected results if that is not true.

warrenmcg / sleuth-CN