Is it reasonable to input regressed and scaled data into CytoTRACE kernel?

hyjforesight commented 2 years ago

... Hello CellRank, In the CellRank beyond RNA velocity tutorial, you input the raw data and do a brief prepocessing. And then initiate the CytoTRACE kernel.

# input raw data
adata = cr.datasets.zebrafish()
adata
AnnData object with n_obs × n_vars = 2434 × 23974
    obs: 'Stage', 'gt_terminal_states', 'lineages'
    uns: 'Stage_colors', 'gt_terminal_states_colors', 'lineages_colors'
    obsm: 'X_force_directed'

# brief preprocessing
sc.pp.filter_genes(adata, min_cells=10)
scv.pp.normalize_per_cell(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)    # only calculate HVG, no HVG slicing

adata.layers["spliced"] = adata.X
adata.layers["unspliced"] = adata.X
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)

I proceed with my own data with this strategy. Strategy 1, the same as the tutorial, no HVG slicing, no regress out, and no scaling, but the CytoTRACE map is very noisy.

adata = sc.read('C:/Users/Park_Lab/Documents/LM7_LM12_raw.h5ad')

sc.pp.filter_cells(adata, min_genes=0)
sc.pp.filter_genes(adata, min_cells=20)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
adata.var['rpl'] = adata.var_names.str.startswith('RPL')
adata.var['rps'] = adata.var_names.str.startswith('RPS')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','rpl','rps'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, keys=['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_rpl','pct_counts_rps'], jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rpl')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rps')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
adata = adata[adata.obs.n_genes_by_counts < 10000, :]
adata = adata[adata.obs.pct_counts_mt < 100, :]
adata = adata[adata.obs.pct_counts_rpl < 100, :]
adata = adata[adata.obs.pct_counts_rps < 100, :]

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)    # only select HVG, didn't do HVG slicing

adata.layers["spliced"] = adata.X
adata.layers["unspliced"] = adata.X
scv.pp.moments(adata, n_pcs=30, n_neighbors=15)

Strategy 2, do HVG slicing, but no regress out, and no scaling. The CytoTRACE map looks better than Strategy 1, but still noisy.

adata = sc.read('C:/Users/Park_Lab/Documents/LM7_LM12_raw.h5ad')

sc.pp.filter_cells(adata, min_genes=0)
sc.pp.filter_genes(adata, min_cells=20)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
adata.var['rpl'] = adata.var_names.str.startswith('RPL')
adata.var['rps'] = adata.var_names.str.startswith('RPS')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','rpl','rps'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, keys=['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_rpl','pct_counts_rps'], jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rpl')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rps')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
adata = adata[adata.obs.n_genes_by_counts < 10000, :]
adata = adata[adata.obs.pct_counts_mt < 100, :]
adata = adata[adata.obs.pct_counts_rpl < 100, :]
adata = adata[adata.obs.pct_counts_rps < 100, :]

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
adata = adata[:, adata.var.highly_variable]    # HVG slicing

adata.layers["spliced"] = adata.X
adata.layers["unspliced"] = adata.X
scv.pp.moments(adata, n_pcs=30, n_neighbors=15)

Strategy 3, do HVG slicing, regress out, and scaling. The CytoTRACE map looks very smooth.

adata = sc.read('C:/Users/Park_Lab/Documents/LM7_LM12_raw.h5ad')

sc.pp.filter_cells(adata, min_genes=0)
sc.pp.filter_genes(adata, min_cells=20)
adata.var['mt'] = adata.var_names.str.startswith('MT-')
adata.var['rpl'] = adata.var_names.str.startswith('RPL')
adata.var['rps'] = adata.var_names.str.startswith('RPS')
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt','rpl','rps'], percent_top=None, log1p=False, inplace=True)
sc.pl.violin(adata, keys=['n_genes_by_counts', 'total_counts', 'pct_counts_mt','pct_counts_rpl','pct_counts_rps'], jitter=0.4, multi_panel=True)
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rpl')
sc.pl.scatter(adata, x='total_counts', y='pct_counts_rps')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
adata = adata[adata.obs.n_genes_by_counts < 10000, :]
adata = adata[adata.obs.pct_counts_mt < 100, :]
adata = adata[adata.obs.pct_counts_rpl < 100, :]
adata = adata[adata.obs.pct_counts_rps < 100, :]

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
adata = adata[:, adata.var.highly_variable]    # HVG slicing
sc.pp.regress_out(adata, keys=['pct_counts_mt','pct_counts_rpl','pct_counts_rps'], n_jobs=16)    # regress out mitochrondrial genes
sc.pp.scale(adata, max_value=10)    # scale data

adata.layers["spliced"] = adata.X
adata.layers["unspliced"] = adata.X
scv.pp.moments(adata, n_pcs=30, n_neighbors=15)

I'm wondering what causes the noise of the directions (maybe gene expression?) and whether it is reasonable to input the HVG sliced, regressed, and scaled data for CytoTRACE analysis? Thanks! Best, YJ

Marius1311 commented 2 years ago

Hi @hyjforesight, that comparison really relates to the CytoTRACE score (or pseudotime), so you should visualize that, rather than the arrows, which come further down the pipeline and are affected by further processing steps. So please visualize that score on the UMAP, and ideally, if you have a coarse understanding of different stages of cell-state transitions in this data, use these to show violin plots of the CytoTRACE score, aggregated by these stages (only if that's something you know).

Marius1311 commented 2 years ago

Any updated here @hyjforesight ?

hyjforesight commented 2 years ago

Hello @Marius1311 Sorry, I was working on some wet experiments. Please see the below results. It shows that different processing of data will affect the directions and pseudotime values (though pseudotime map looks similar, people prefer the stream map), which may misdirect the conclusions we made based on cytoTRACE. It will be great if you could share some opinions about which type of processed data will be the best to reflect the reality of the biology truth, no HVG sliced, HVG sliced or HVG sliced+regressed+scaled data? Thanks! Best, YJ

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)    # only select HVG, didn't do HVG slicing

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
adata = adata[:, adata.var.highly_variable]    # HVG slicing

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=5000)
adata = adata[:, adata.var.highly_variable]    # HVG slicing
sc.pp.regress_out(adata, keys=['pct_counts_mt','pct_counts_rpl','pct_counts_rps'], n_jobs=16)    # regress out mitochrondrial genes
sc.pp.scale(adata, max_value=10)    # scale data

Marius1311 commented 2 years ago

Hi @hyjforesight, does the numbering (0-5) reflect what you believe to be the biological order of clusters in this data?

Marius1311 commented 2 years ago

Re what type of processed data: see CytoTRCE tutorial and the original paper please.

hyjforesight commented 2 years ago

hello @Marius1311 Thanks for the late response.

Hi @hyjforesight, does the numbering (0-5) reflect what you believe to be the biological order of clusters in this data?

That's the tricky part because we're analyzing the cancer data, of which the clusters cannot be annotated with existing knowledge. So it would be important to decide which type of data should be used because the directions and pseudotime will affect our conclusions. For example, if we use the raw data with no HVG sliced (like your official tutorial did), we can tell that cluster 0 has both differentiated and undifferentiated cells.

However, if we use HVG sliced&scaled data, it will tell that cluster 0 only has the differentiated cells.

We've no idea which conclusion is right because using raw data as the tutorial did makes the directions quite messed, while using HVG sliced&scaled data makes directions smooth, just like the smooth stream the tutorial shows in the zebrafish embryogenesis. How about your opinions?

Re what type of processed data: see CytoTRCE tutorial and the original paper please.

Sorry, haven't read yet. Gonna read this amazing paper soon.

Thanks! Best, YJ

Marius1311 commented 2 years ago

Hi YJ, please let me know whether your question persists once you've read the original publication. thanks.

hyjforesight commented 2 years ago

hello @Marius1311 Sorry for the late response. I enjoyed reading your amazing paper. It updated my understanding of CellRank. In the paper, RNA velocity-based analysis was demonstrated as an example for elucidating CellRank. Just the same as the scVelo package does, raw data was input, normalized, HVG selected, and then proceeded with CellRank. I have no doubt about these procedures. We did the same. But for the CytoTRACE kernel, because no example was shown in the paper, and the CellRank beyond RNA velocity tutorial only did log-transformation before proceeding data into kernel, we're still a little confused:

Log-transformed data without selecting HVG and scaling makes the stream of CytoTRACE noisy for our own data (maybe too many cells or too many genes?). That's why it makes me think about using HVG selected or scaled data as an input for the cytoTRACE kernel of CellRANK.
By checking the CytoTRACE original package (https://cytotrace.stanford.edu/#e), I found that it requires non-normalized data. No need to do further HVG selection and scaling. The package will do log-normalization internally. I'm wondering whether the cytoTRACE kernel of CellRANK did log-normalization internally? It looks not, because the tutorial did sc.pp.filter_genes(adata, min_cells=10), scv.pp.normalize_per_cell(adata), and sc.pp.log1p(adata) before ctk = CytoTRACEKernel(adata). Not sure whether my interpretation is right.
CellRANK corrects the noise of RNA velocity by combining with the concept of gene expression similarity and softly allocating cells into macrostates. This reminds me the first introduction of pesudotime by Monocle. As we know, when making the trajectory and calculating pseudtoime, Monocle uses the log-transformed, HVG selected and scaled date. Does it give us a hint that we could also use log-transformed, HVG selected and scaled date for cytoTRACE kernel of CellRANK? Thanks! Best, YJ

Marius1311 commented 2 years ago

Hi @hyjforesight, sorry for my delayed response. Here's our reasoning for why we use certain transformations in the CytoTRACEKernel:

CytoTRACE counts the number of genes expressed per cell, that's the raw signal. For this to work, we have to consider all genes for which we have reliable expression quantification, not just highly expressed genes. Accordingly, we do some basic gene filtering in our tutorial, sc.pp.filter_genes(adata, min_cells=10)
CytoTRACE next correlates the number of genes expressed per cell with gene expression. We do this using normalized, log-transformed counts, as in the original publication (scv.pp.normalize_per_cell(adata) and sc.pp.log1p(adata)). We compute the correlation for all genes we selected in the previous step.
CytoTRACE then selects the top 200 genes which had the highest correlation values in step 2 and aggregates their imputed gene expression (e.g. mean aggregation). For the imputation, we use scVelo's moments function which does kNN imputation. Under the hood, scv.pp.moments(adata) computes a PCA representation and a kNN graph in this representation. Note that by default, only genes annotated as "highly variable" are considered for PCA computation in scanpy (which scVelo uses under the hood). Thus, when we run sc.pp.highly_variable_genes(adata) in our tutorial, we annotate highly variable genes, but we don't filter. While in step 1 and 2, we use all genes (subject to weak filtering), we use only highly variable genes for kNN imputation in step 3.
We continue to use the same kNN graph in the CytoTRACEKernel to direct kNN edges into the direction of increasing CytoTRACE pseudotime.

That's our recommended best practice. We noticed that CytoTRACE works well in a number of developmental scenarios, especially in early development. Of course, you're free to experiment with preprocessing to find what works best for you; however, please consider the possibility that the CytoTRACE assumption might simply be violated in your biological setting. If that's the case, it's probably better to use a different method, rather than trying to make CytoTRACE work in a setting it's not designed for.

Links

CytoTRACE tutorial: https://cellrank.readthedocs.io/en/stable/beyond_rna_velocity.html
CytoTRACE paper: https://www.science.org/doi/10.1126/science.aax0249

hyjforesight commented 2 years ago

Thank you @Marius1311 for such a detailed explanation! Appreciate it!

theislab / cellrank

Is it reasonable to input regressed and scaled data into CytoTRACE kernel? #826