Batch correction incorrectly assigns cell labels

Hrovatin commented 9 months ago

I have tried to integrate some of my own data and then reproduce the example from https://scgen.readthedocs.io/en/stable/tutorials/scgen_batch_removal.html , but it seems that the latent data and the obs are not joined correctly, creating wrong cell latent embedding-label pairs.

This is the result from the tutorial, with a clear mismatch between cell type clusters and cell labels

I think the reason could be in https://github.com/theislab/scgen/blob/06084773e56cad0dec340138441dee47a39af752/scgen/_scgen.py#L315C16-L315C16 as you don't check that indices match, but I haven't tested it so it may be a different reason.

scGEN version: 2.1.1

ps. the tutorial also has other mistakes, like cell_type->celltype and the use_rep is missing in neighbours computation for latent

Hrovatin commented 9 months ago

Indeed, changing the above line to corrected.obsm["latent"] = all_corrected_data[corrected.obs_names,:].X fixes the issue.

I also needed to add .detach() to self.module.generative(torch.Tensor(all_corrected_data.X))["px"].cpu().detach().numpy()

M0hammadL commented 9 months ago

Hi Karin

Thanks for pointing this out, could you kindly add that as a PR we can merge it then

M0hammadL commented 9 months ago

Btw see here you can do the same thing with cpa:

https://cpa-tools.readthedocs.io/en/latest/tutorials/Batch_correction_in_expression_space.html

Hrovatin commented 9 months ago

The PR is here: https://github.com/theislab/scgen/pull/87 I would just merge despite black failing as I didn't introduce any major formatting changes except the 4 lines as mentioned above

theislab / scgen

Batch correction incorrectly assigns cell labels #86