scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.87k stars 595 forks source link

ingest confidence thresholding #3160

Open sakatash opened 1 month ago

sakatash commented 1 month ago

What kind of feature would you like to request?

Additional function parameters / changed functionality / changed defaults?

Please describe your wishes

First thank you for the amazing tool you guys developed

I am currently trying to use the ingest function to map cluster identities from a single cell dataset to a spatial data set. the single cell is composed from a subset of cell types while the spatial has all of the cells. The single cell has many genes while the spatial has a few. I parse out the common genes from from both datasets and use those to run the ingest function which works quiet well!

common_genes = sc_data.var_names.intersection(spatial_data.var_names)
sc_data = sc_data[:, common_genes]
spatial_data = spatial_data[:, common_genes]
sc.tl.pca(sc_data, svd_solver='arpack')
sc.pp.neighbors(sc_data, use_rep='X_pca')
sc.tl.umap(sc_data)
sc.tl.pca(spatial_data, svd_solver='arpack')
sc.tl.ingest(spatial_data, sc_data, obs='unified_clusters')

The issue I am running into is that the ingest function forces an identity onto cells even though the confidence of that identity is probably very low.

I am getting around that, in part, by subsetting the spatial dataset, but it would be terrific if I could use a confidence parameter to specify which cells would get an identity at all.

I was playing around with the ingest scripts and was thinking of something like this

    def map_labels(self, labels: str, method: str, confidence_threshold: float = 0.5):

        if method == 'knn':
            self.neighbors()
            cat_array: pd.Series = self._adata_ref.obs[labels].astype("category")

            confident_labels = []
            for inds in self._indices:
                mode_label = cat_array.iloc[inds].mode()[0]
                mode_count = (cat_array.iloc[inds] == mode_label).sum()

                confidence = mode_count / len(inds)

                if confidence >= confidence_threshold:
                    confident_labels.append(mode_label)
                else:
                    confident_labels.append('unassigned')

            self._adata_new.obs[labels] = pd.Categorical(values=confident_labels, categories=cat_array.cat.categories)
        else:
            raise NotImplementedError("Ingest supports knn labeling for now.")

perhaps I am misunderstanding the tool, or unaware of another tool which exists for my purpose, and would love input and help

sakatash commented 1 month ago

update:

managed to get a confidence thresholding with this type of logic:

    def _knn_classify(self, labels):
        # ensure it's categorical
        cat_array: pd.Series = self._adata_ref.obs[labels].astype("category")
        values = []
        confidences = []

        for inds in self._indices:
            mode_value = cat_array.iloc[inds].mode()[0]
            mode_count = (cat_array.iloc[inds] == mode_value).sum()
            confidence = mode_count / len(inds)
            values.append(mode_value)
            confidences.append(confidence)

        # Create a DataFrame for better readability
        classification_df = pd.DataFrame({
            "Mode Values": values,
            "Confidences": confidences
        })
        print(classification_df)

        return pd.Categorical(values=values, categories=cat_array.cat.categories), np.array(confidences)

    def map_labels(self, labels, method, confidence_threshold: float = 0.5):
        """\
        Map labels of `adata` to `adata_new`.

        This function infers `labels` for `adata_new.obs`
        from existing labels in `adata.obs`.
        `method` can be only 'knn'.
        """
        if method == "knn":
            classified_labels, confidences = self._knn_classify(labels)
            mask = confidences >= confidence_threshold

            filtered_labels = [
                label if mask[idx] else np.nan 
                for idx, label in enumerate(classified_labels)
            ]

            classified_labels = pd.Categorical(
                filtered_labels,
                categories=classified_labels.categories
            )

            self._adata_new.obs[labels] = classified_labels
            self._adata_new.obs[labels + '_confidence'] = confidences
        else:
            raise NotImplementedError("Ingest supports knn labeling for now.")

would love to get input on whether or not this makes sense

flying-sheep commented 1 month ago

Hi! I don’t know if it makes sense statistically, but having a metric like this would be nice.

@Koncopd could you please take a look?