scverse / scvi-tools

Deep probabilistic analysis of single-cell and spatial omics data
http://scvi-tools.org/
BSD 3-Clause "New" or "Revised" License
1.24k stars 350 forks source link

Fix custom dataloader registry #2907

Open canergen opened 3 months ago

canergen commented 3 months ago

CustomDataloaders currently don't support advanced capabilities like scArches or celltype prediction in scANVI. We have to create a registry without setup_anndata that contains the same elements (see below). https://github.com/chanzuckerberg/cellxgene-census/blob/222efddf2ce82f93f76329aa353962c1dc2400ac/api/python/notebooks/experimental/pytorch_loader_scvi.ipynb is the first working example. Currently, they use the following code to save the model:

user_attributes = model._get_user_attributes()
user_attributes = {a[0]: a[1] for a in user_attributes if a[0][-1] == "_"}

user_attributes.update(
    {
        "n_batch": datamodule.n_batch,
        "n_extra_categorical_covs": 0,
        "n_extra_continuous_covs": 0,
        "n_labels": 1,
        "n_vars": datamodule.n_vars,
    }
)

We want to create a new function that fills out the registry and passes it to the model at: model = scvi.model.SCVI(n_layers=n_layers, n_latent=n_latent, gene_likelihood="nb", encode_covariates=False). You can see all necessary entries and the structure at: scvi.adata_manager.get_state_registry(scvi.REGISTRY_KEYS.X_KEY).to_dict(). After fixing this, all uses of _module_init_on_train throughout the codebase should be removed as they are not necessary anymore.

gokceneraslan commented 2 months ago

Is there some documentation on what is expected of the custom dataloader's collate function? I can imagine a dict with keys like X, batch and labels just by following up on the different types of exceptions I am getting. But for poor souls like us who are not familiar with the codebase, it'd be amazing to have some documentation of what type of keys a collate function should return in the dictionary to work.

canergen commented 2 months ago

Hi, we are currently still exchanging ideas with lamin and CZI to make the implementation better (and hopefully work towards support throughout all models - currently scVI works). Overall, the final requirement will be that a registry as a dictionary is created similar to https://colab.research.google.com/drive/10sXec_TicMKtLA6hMcgfkado-FgoNKxw#scrollTo=e8vZgceklGdH. We use as a discussion channel https://github.com/laminlabs/lamindb/issues/1826 to work together on a better implementation. Happy to connect offline (best case scverse Zulip) to see how we can support your work.