cell_id or cell_type for dataset.data_params.covariate_keys

theislab / chemCPA

Code for "Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution", NeurIPS 2022.

https://arxiv.org/abs/2204.13545

MIT License

104 stars 24 forks source link

cell_id or cell_type for dataset.data_params.covariate_keys #129

Closed bhomass closed 1 year ago

bhomass commented 1 year ago

There are a few .yaml files which declare dataset.data_params.covariate_keys: cell_id

most declare dataset.data_params.covariate_keys: cell_type

but the code in data.py

for i in range(len(self)):
    drug = indx(self.drugs_names, i)
    cov = indx(self.covariate_names["cell_type"], i)

has hard coding expecting cell_type.

Should I assume all instances of "cell_id" need to be converted to "cell_type"?

turns out they are one and the same in value adata.obs['cell_type'] = adata.obs['cell_id']

MxMstrmn commented 1 year ago

Hi @bhomass,

Can you give more detail on the dataset you are referring to? I assume the Sciplex data? I agree that this bit of the code needs some refactoring. To give a bit more context: chemCPA is designed to operate with any number of covariates but requires the cell_type one, which should always be present.

bhomass commented 1 year ago

Any yaml file with covariants_keys set to cell_id needs to be changed to cell_type. There are many such yaml files throughout the repo. This is independent of which dataset. Like you said, the data.py code is fixed to look for cell_type.

you can see cov = indx(self.covariate_names["cell_type"], i) but self.covariate_names came from self.covariate_keys

        self.covariate_names = {}
        for cov in self.covariate_keys:
            self.covariate_names[cov] = indx(dataset.covariate_names[cov], indices)