What do you recommend for single cell RNAseq data: counts, normalized counts in log scale, other?

neurorestore / Augur

Cell type prioritization in single-cell data

MIT License

103 stars 12 forks source link

What do you recommend for single cell RNAseq data: counts, normalized counts in log scale, other? #21

Open apeleraux opened 2 years ago

apeleraux commented 2 years ago

You indicated in your publication that your method is relatively robust to various preprocessing and normalization steps. However I tested it on a single cell RNAseq dataset using counts or normalized log-transformed counts as input data matrix and found quite different cell type prioritization results. What would you generally recommend to use?

skinnider commented 2 years ago

We almost exclusively run Augur on raw counts. The exception is for very acute perturbations (e.g. mice walking on a treadmill for 15 min prior to sample collection) where we found that running estimates of RNA velocity provide more information than raw counts.

apeleraux commented 2 years ago

Thanks for your fast answer. I understand the need for RNA velocity estimates in certain cases. We are mostly interested in longer time frames, so raw counts would be our choice. In such case, does Augur include normalization by total counts per cell or other similar normalization optimized for single cell RNAseq data? Intuitively, it would seem to me that classification between 2 conditions should be performing better on normalized data, and that therefore Augur may work better using normalized data. But of course I may be wrong ! Have you investigated this question or do you know relevant papers on this topic ?

skinnider commented 2 years ago

It's important to consider that 'better classification' is not really the goal of Augur - instead we are trying to identify cell types that are showing a transcriptional response to a perturbation, and so what's really of interest are the relative differences in classification accuracy between cell types. In our initial experiments, we saw minimal changes in the relative rankings of individual cell types when normalizing gene expression (e.g. with log-TP10K). However, we did find that there was less separation between cell types when running Augur on normalized gene expression values and so we generally run Augur on untransformed counts. In terms of understanding why Augur is so robust to running on untransformed counts, Extended Data Fig. 10 in the Nat. Biotechnol. paper might be useful in thinking about the kinds of scenarios that would be required for sequencing depth to be a confounding factor in the analysis.

apeleraux commented 1 year ago

Thanks a lot for your answer. When I was speaking of better classification, I actually meant higher accuracy of classification between unperturbed and perturbed cells. So I believe that we are on the same page. Thanks for pointing me to Fig 10 of the extended data, I will have a further look at it.

kaizen89 commented 1 year ago

@skinnider looking at the code of augur when using seurat object it seems that the default slot used is data which corresponds to normalized data and not the raw counts as you recommend, might be worth changing the default behaviour?