Closed hyjforesight closed 2 years ago
Hi @hyjforesight, that comparison really relates to the CytoTRACE score (or pseudotime), so you should visualize that, rather than the arrows, which come further down the pipeline and are affected by further processing steps. So please visualize that score on the UMAP, and ideally, if you have a coarse understanding of different stages of cell-state transitions in this data, use these to show violin plots of the CytoTRACE score, aggregated by these stages (only if that's something you know).
Any updated here @hyjforesight ?
Hello @Marius1311 Sorry, I was working on some wet experiments. Please see the below results. It shows that different processing of data will affect the directions and pseudotime values (though pseudotime map looks similar, people prefer the stream map), which may misdirect the conclusions we made based on cytoTRACE. It will be great if you could share some opinions about which type of processed data will be the best to reflect the reality of the biology truth, no HVG sliced, HVG sliced or HVG sliced+regressed+scaled data? Thanks! Best, YJ
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000) # only select HVG, didn't do HVG slicing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
adata = adata[:, adata.var.highly_variable] # HVG slicing
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
adata = adata[:, adata.var.highly_variable] # HVG slicing
sc.pp.regress_out(adata, keys=['pct_counts_mt','pct_counts_rpl','pct_counts_rps'], n_jobs=16) # regress out mitochrondrial genes
sc.pp.scale(adata, max_value=10) # scale data
Hi @hyjforesight, does the numbering (0-5) reflect what you believe to be the biological order of clusters in this data?
Re what type of processed data: see CytoTRCE tutorial and the original paper please.
hello @Marius1311 Thanks for the late response.
Hi @hyjforesight, does the numbering (0-5) reflect what you believe to be the biological order of clusters in this data?
That's the tricky part because we're analyzing the cancer data, of which the clusters cannot be annotated with existing knowledge. So it would be important to decide which type of data should be used because the directions and pseudotime will affect our conclusions. For example, if we use the raw data with no HVG sliced (like your official tutorial did), we can tell that cluster 0 has both differentiated and undifferentiated cells.
However, if we use HVG sliced&scaled data, it will tell that cluster 0 only has the differentiated cells.
We've no idea which conclusion is right because using raw data as the tutorial did makes the directions quite messed, while using HVG sliced&scaled data makes directions smooth, just like the smooth stream the tutorial shows in the zebrafish embryogenesis. How about your opinions?
Re what type of processed data: see CytoTRCE tutorial and the original paper please.
Sorry, haven't read yet. Gonna read this amazing paper soon.
Thanks! Best, YJ
Hi YJ, please let me know whether your question persists once you've read the original publication. thanks.
hello @Marius1311
Sorry for the late response. I enjoyed reading your amazing paper. It updated my understanding of CellRank.
In the paper, RNA velocity-based analysis was demonstrated as an example for elucidating CellRank. Just the same as the scVelo package does, raw data was input, normalized, HVG selected, and then proceeded with CellRank. I have no doubt about these procedures. We did the same.
But for the CytoTRACE kernel, because no example was shown in the paper, and the CellRank beyond RNA velocity
tutorial only did log-transformation before proceeding data into kernel, we're still a little confused:
sc.pp.filter_genes(adata, min_cells=10)
, scv.pp.normalize_per_cell(adata)
, and sc.pp.log1p(adata)
before ctk = CytoTRACEKernel(adata)
. Not sure whether my interpretation is right.
Hi @hyjforesight, sorry for my delayed response. Here's our reasoning for why we use certain transformations in the CytoTRACEKernel
:
sc.pp.filter_genes(adata, min_cells=10)
scv.pp.normalize_per_cell(adata)
and
sc.pp.log1p(adata)
). We compute the correlation for all genes we selected in the previous step. moments
function which does kNN imputation. Under the hood, scv.pp.moments(adata)
computes a PCA representation and a kNN graph in this representation. Note that by default, only genes annotated as "highly variable" are considered for PCA computation in scanpy (which scVelo uses under the hood). Thus, when we run sc.pp.highly_variable_genes(adata)
in our tutorial, we annotate highly variable genes, but we don't filter. While in step 1 and 2, we use all genes (subject to weak filtering), we use only highly variable genes for kNN imputation in step 3. CytoTRACEKernel
to direct kNN edges into the direction of increasing CytoTRACE pseudotime. That's our recommended best practice. We noticed that CytoTRACE works well in a number of developmental scenarios, especially in early development. Of course, you're free to experiment with preprocessing to find what works best for you; however, please consider the possibility that the CytoTRACE assumption might simply be violated in your biological setting. If that's the case, it's probably better to use a different method, rather than trying to make CytoTRACE work in a setting it's not designed for.
Links
Thank you @Marius1311 for such a detailed explanation! Appreciate it!
... Hello CellRank, In the
CellRank beyond RNA velocity
tutorial, you input the raw data and do a brief prepocessing. And then initiate the CytoTRACE kernel.I proceed with my own data with this strategy. Strategy 1, the same as the tutorial, no HVG slicing, no regress out, and no scaling, but the CytoTRACE map is very noisy.
Strategy 2, do HVG slicing, but no regress out, and no scaling. The CytoTRACE map looks better than Strategy 1, but still noisy.
Strategy 3, do HVG slicing, regress out, and scaling. The CytoTRACE map looks very smooth.
I'm wondering what causes the noise of the directions (maybe gene expression?) and whether it is reasonable to input the HVG sliced, regressed, and scaled data for CytoTRACE analysis? Thanks! Best, YJ