scverse / scvi-tools

Deep probabilistic analysis of single-cell and spatial omics data
http://scvi-tools.org/
BSD 3-Clause "New" or "Revised" License
1.25k stars 352 forks source link

n_proteins Parameter in MultiVI Class #2952

Closed GoldenCaterpie closed 2 months ago

GoldenCaterpie commented 2 months ago

Hello! I am currently conducting research on single-cell multimodal data. I accessed the code for the paper "MultiVI: deep generative model for the integration of multimodal data" published on Zenodo. In the file Protein_update_3_TESTING, I found the following code:

# ######################################################################################################################
# TRAIN ALL 3 MODALITIES
adata = anndata.read("dogma_all_genes_cells_dig_ctrl_annotated.h5ad.gz")
adata = adata.copy()
scvi.data.setup_anndata(adata, protein_expression_obsm_key='protein_expression')

n_genes = (adata.var.modality == 'Gene Expression').sum()
n_regions = (adata.var.modality == 'Peaks').sum()
n_proteins = adata.obsm['protein_expression'].shape[1]

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

mvi = scvi.model.MULTIVI(adata, n_genes=n_genes, n_regions=n_regions, n_proteins=n_proteins)
testing(mvi, save_path="trained_models/Test3Mod_DIGCTRL75b_211020", pdf_path="Test3Mod_DIGCTRL75B_")
# ######################################################################################################################

In this code, I noticed that a parameter called 'n_proteins' is set when creating the MULTIVI model.

However, in version 1.1.6 of scvi-tools, when I input similar code attempting to specify the n_proteins parameter, such as:

model = scvi.model.MULTIVI(
    adata_mvi,
    n_genes=(adata_mvi.var["modality"] == "Gene Expression").sum(),
    n_regions=(adata_mvi.var["modality"] == "Peaks").sum(),
    n_proteins=0,
)
model.view_anndata_setup()

It results in an error: TypeError: MULTIVAE.init() got an unexpected keyword argument 'n_proteins'.

Upon inspecting the _multivi.py source file, I indeed found that the class does not have an n_proteins parameter:

"""Integration of multi-modal and single-modality data :cite:p:`AshuachGabitto21`.

    MultiVI is used to integrate multiomic datasets with single-modality (expression
    or accessibility) datasets.

    Parameters
    ----------
    adata
        AnnData object that has been registered via :meth:`~scvi.model.MULTIVI.setup_anndata`.
    n_genes
        The number of gene expression features (genes).
    n_regions
        The number of accessibility features (genomic regions).
    modality_weights
        Weighting scheme across modalities. One of the following:
        * ``"equal"``: Equal weight in each modality
        * ``"universal"``: Learn weights across modalities w_m.
        * ``"cell"``: Learn weights across modalities and cells. w_{m,c}
    modality_penalty
        Training Penalty across modalities. One of the following:
        * ``"Jeffreys"``: Jeffreys penalty to align modalities
        * ``"MMD"``: MMD penalty to align modalities
        * ``"None"``: No penalty
    n_hidden
        Number of nodes per hidden layer. If `None`, defaults to square root
        of number of regions.
    n_latent
        Dimensionality of the latent space. If `None`, defaults to square root
        of `n_hidden`.
    n_layers_encoder
        Number of hidden layers used for encoder NNs.
    n_layers_decoder
        Number of hidden layers used for decoder NNs.
    dropout_rate
        Dropout rate for neural networks.
    model_depth
        Model sequencing depth / library size.
    region_factors
        Include region-specific factors in the model.
    gene_dispersion
        One of the following
        * ``'gene'`` - genes_dispersion parameter of NB is constant per gene across cells
        * ``'gene-batch'`` - genes_dispersion can differ between different batches
        * ``'gene-label'`` - genes_dispersion can differ between different labels
    protein_dispersion
        One of the following
        * ``'protein'`` - protein_dispersion parameter is constant per protein across cells
        * ``'protein-batch'`` - protein_dispersion can differ between different batches NOT TESTED
        * ``'protein-label'`` - protein_dispersion can differ between different labels NOT TESTED
    latent_distribution
        One of
        * ``'normal'`` - Normal distribution
        * ``'ln'`` - Logistic normal distribution (Normal(0, I) transformed by softmax)
    deeply_inject_covariates
        Whether to deeply inject covariates into all layers of the decoder. If False,
        covariates will only be included in the input layer.
    fully_paired
        allows the simplification of the model if the data is fully paired. Currently ignored.
    **model_kwargs
        Keyword args for :class:`~scvi.module.MULTIVAE`
    ...

What’s going on here? Was this parameter removed in a new version of scvi-tools, or is this a BUG? Looking forward to your reply! Thx!

Versions:

VERSION 1.16

canergen commented 2 months ago

Yes, we changed the multiVI code after the initial release. The correct version of multiVI for reproducibility should be defined there. @marianogabitto can you otherwise suggest the correct versions.

GoldenCaterpie commented 2 months ago

Yes, we changed the multiVI code after the initial release. The correct version of multiVI for reproducibility should be defined there. @marianogabitto can you otherwise suggest the correct versions.

Thx. And does this mean that current version of MultiVI is now specifically designed to handle multiome datasets and no longer supports protein data? If that’s not the case, could you plz provide some corresponding tutorials?🧐

canergen commented 2 months ago

Hi, you have to add protein_expression_obsm_key to setup_anndata when setting up the model to use the protein data. MultiVI can handle RNA+protein+ATAC and any combination of these. There is no tutorial beyond RNA+ATAC.