Inconsistency between assay data used and what is advised in SingleR?

denvercal1234GitHub commented 10 months ago

Hi there,

Q1: In the tutorial, logcounts were used, but in SingleR documentation, it was strongly advised against using of any transformed data and prefer raw counts. Would you mind clarifying if there is a reason you guys use logcounts here?

Q2: If my sce object is flow data, I will simply set Matrix::Matrix(sparse = F)?

Thank you.

# Get cell type reference data
blueprint <- celldex::BlueprintEncodeData()

# Infer cell identities
cell_type_df <-

    assays(pbmc_small_UMAP)$logcounts %>%
    Matrix::Matrix(sparse = TRUE) %>%
    SingleR::SingleR(
        ref = blueprint,
        labels = blueprint$label.main,
        method = "single"
    ) %>%
    as.data.frame() %>%
    as_tibble(rownames="cell") %>%
    select(cell, first.labels)

From SingleR (https://bioconductor.org/books/release/SingleRBook/classic-mode.html): "For the test data, the assay data need not be log-transformed or even (scale) normalized. This is because SingleR() computes Spearman correlations within each cell, which is unaffected by monotonic transformations like cell-specific scaling or log-transformation. It is perfectly satisfactory to provide the raw counts for the test dataset to SingleR(), which is the reason for setting assay.type.test=1 in our previous SingleR() call for the Grun dataset."

denvercal1234GitHub commented 10 months ago

Does not need to be but ok if is. "Technically speaking, the test dataset does not need log-expression values but we compute them anyway for convenience."

stemangiola commented 10 months ago

could you point to which tutorial you are referring to?

stemangiola commented 10 months ago

@susansjy22 is this the case for HPCell?

denvercal1234GitHub commented 10 months ago

could you point to which tutorial you are referring to?

This guide: https://stemangiola.github.io/tidySingleCellExperiment/articles/introduction.html

stemangiola commented 10 months ago

@william-hutchison could you please help with this?

william-hutchison commented 10 months ago

Regarding point 1, I think the tutorial material is okay as is. The test data does not have to have to be log-transformed, but it does not have to be raw either:

"For the test data, the assay data need not be log-transformed or even (scale) normalized. This is because SingleR() computes Spearman correlations within each cell, which is unaffected by monotonic transformations like cell-specific scaling or log-transformation." https://bioconductor.org/books/release/SingleRBook/classic-mode.html

I assume this is why you closed the issue @denvercal1234GitHub ? Please let us know if you have any further concerns though.

I could add a note on this topic to the tutorial. Although given any assay data is fine, maybe this information is unnecessary for the user.

stemangiola commented 10 months ago

The thing I would like to understand is, is proving lo-transformed data an error?

Either SingleR has a method to detect if data is logged, or applying a statistics designed to non-logged data (and the other way around), to logged data, is almost never a good idea.

susansjy22 commented 10 months ago

@susansjy22 is this the case for HPCell?

For HPCell, logNormCounts() is used to transform the test data prior to annotation with SingleR(). This should be fine, as @william-hutchison has stated. Since SingleR() uses Spearman correlation which is relies on a rank order of value rather than their actual magnitudes, so monotonic transformations like log-transformation or scaling wouldn’t affect the analysis.

william-hutchison commented 10 months ago

Yes I can confirm, I just tested the SingleR with both logcounts and counts and the output was identical.

stemangiola commented 10 months ago

Very well explained! OK, in this case, @susansjy22 let's omit the transformation for SingleR execution, @william-hutchison we can omit it in the tidySCE documentation.

Link the pull requests on the respective repositories in this issue (under development menu on the right on the pull request)

stemangiola / tidySingleCellExperiment

Inconsistency between assay data used and what is advised in SingleR? #93