scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.86k stars 595 forks source link

why scale is not in spatial #2963

Closed asmlgkj closed 5 months ago

asmlgkj commented 5 months ago

Please make sure these conditions are met

What happened?

Thanks a lot, I am new to scanpy.

https://scanpy-tutorials.readthedocs.io/en/latest/spatial/basic-analysis.html why scale is not here,which is needed in traditional sc-rna seq

sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(
    adata, key_added="clusters", flavor="igraph", directed=False, n_iterations=2
)

Minimal code sample

sc.pp.normalize_total(adata, inplace=True)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, flavor="seurat", n_top_genes=2000)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(
    adata, key_added="clusters", flavor="igraph", directed=False, n_iterations=2
)

### Error output

_No response_

### Versions

<details>

anndata 0.10.3 scanpy 1.9.8



</details>
asmlgkj commented 5 months ago

https://github.com/scverse/scanpy/issues/2164 I found this issues talked about it, but seemed there was not a conclusion. can anyone give me some help. because seurat seemed to used scale all the time

ivirshup commented 5 months ago

My impression has been that doing the densifying scale transform didn't seem to show performance improvements in a number of benchmarks. This is also the workflow used in sc-best-practices

@Zethson do you have a good citation for this?

Zethson commented 5 months ago

IIRC, it's discussed in more detail in Malte's paper:

In the same way that cellular count data can be normalized to make them comparable between cells, gene counts can be scaled to improve comparisons between genes. Gene normalization constitutes scaling gene counts to have zero mean and unit variance (z scores). This scaling has the effect that all genes are weighted equally for downstream analysis. There is currently no consensus on whether or not to perform normalization over genes. While the popular Seurat tutorials (Butler et al, 2018) generally apply gene scaling, the authors of the Slingshot method opt against scaling over genes in their tutorial (Street et al, 2018). The preference between the two choices revolves around whether all genes should be weighted equally for downstream analysis, or whether the magnitude of expression of a gene is an informative proxy for the importance of the gene. In order to retain as much biological information as possible from the data, we opt to refrain from scaling over genes in this tutorial.

https://www.embopress.org/doi/full/10.15252/msb.20188746

Since there has been no new development on this topic, we cited Malte and also opted not to scale. This is also discussed by Malte himself in the issue that was cited above.

I cannot comment on spatial data itself and make confident statements here.

asmlgkj commented 5 months ago

My impression has been that doing the densifying scale transform didn't seem to show performance improvements in a number of benchmarks. This is also the workflow used in sc-best-practices

@Zethson do you have a good citation for this?

Here's the English version of the reply:

Thank you very much for your authoritative answer! You mentioned that in some benchmarks, performing the densifying scale transform didn't show significant performance improvements. I also noticed that sc-best-practices adopts a similar workflow.

However, I have a further question: if the step of adding this densifying scale transform is included, would it negatively impact the overall performance? For example, would it reduce the training or inference speed? Or would the impact be negligible?

Thank you again for taking the time to answer my questions! Your opinions are very insightful and helpful to me. I look forward to your further guidance!

asmlgkj commented 5 months ago

IIRC, it's discussed in more detail in Malte's paper:

In the same way that cellular count data can be normalized to make them comparable between cells, gene counts can be scaled to improve comparisons between genes. Gene normalization constitutes scaling gene counts to have zero mean and unit variance (z scores). This scaling has the effect that all genes are weighted equally for downstream analysis. There is currently no consensus on whether or not to perform normalization over genes. While the popular Seurat tutorials (Butler et al, 2018) generally apply gene scaling, the authors of the Slingshot method opt against scaling over genes in their tutorial (Street et al, 2018). The preference between the two choices revolves around whether all genes should be weighted equally for downstream analysis, or whether the magnitude of expression of a gene is an informative proxy for the importance of the gene. In order to retain as much biological information as possible from the data, we opt to refrain from scaling over genes in this tutorial.

https://www.embopress.org/doi/full/10.15252/msb.20188746

Since there has been no new development on this topic, we cited Malte and also opted not to scale. This is also discussed by Malte himself in the issue that was cited above.

I cannot comment on spatial data itself and make confident statements here.

Thanks a lot so is there a conclusion or recommendation whether scale or not on spatial data? @ivirshup @Zethson

Zethson commented 5 months ago

I would CC @AnnaChristina @giovp

asmlgkj commented 5 months ago

what does CC mean, thanks alot

Zethson commented 5 months ago

@asmlgkj sorry, I just wanted to bring two experts into this discussion :)

To "CC" someone in an email means to send them a copy of the email while indicating that they are not the primary recipient. The term "CC" stands for "Carbon Copy," originating from the paper correspondence era where a carbon paper was used to make a copy of a letter. In the context of emails, adding a recipient's email address in the CC field sends them a copy of the email as an FYI (For Your Information).

asmlgkj commented 5 months ago

@asmlgkj sorry, I just wanted to bring two experts into this discussion :)

To "CC" someone in an email means to send them a copy of the email while indicating that they are not the primary recipient. The term "CC" stands for "Carbon Copy," originating from the paper correspondence era where a carbon paper was used to make a copy of a letter. In the context of emails, adding a recipient's email address in the CC field sends them a copy of the email as an FYI (For Your Information).

You are really kind and great

giovp commented 5 months ago

I think what's int the best practice book is what it should be probably followed as well in spatial for array-based technology like visium. For xenium or image-based transcriptomic, while this is the standard processing that seems to work as well, is probably not optimal.

ivirshup commented 5 months ago

I think this question has been answered so will close the issue. Let me know if not!

asmlgkj commented 5 months ago

it seems that the spatial has no final descision on scale or not @ivirshup thanks a lot