jaydu1 / scVAEIT

Variational autoencoder for single-cell integration and transfer learning.
MIT License
6 stars 0 forks source link

Integration of CITE-seq and scMultiome #10

Open hoseok-lee opened 1 month ago

hoseok-lee commented 1 month ago

Hi,

I'm currently looking to use scVAEIT to integrate two different datasets, CITE-seq (containing scRNA-seq and protein abundance) and scMultiome (containing scRNA-seq and scATAC-seq). They have mutually exclusive cells, but I was hoping to use the scRNA-seq modality present in both datasets to integrate all three modalities (scRNA-seq, protein abundance, chromatin-accessibility).

As all of the tutorials showcase integration for cases where there are equal or more datasets being integrated than modalities (for example, the trimodality merge contains DOGMA-seq, ASAP-seq, and CITE-seq), I was wondering how I could set up the configuration of the model for the case where there are more modalities than datasets. For example, I'm not sure how to approach how to set-up the batches and id_datasets configuration, in addition to the masks matrix.

Any help is appreciated! Thank you in advance :)

jaydu1 commented 1 month ago

Hi, thanks for your interest. I have added a Notebook that uses toy data examples to illustrate how to prepare data input for scVAEIT. It contains utility functions to concatenate multiple datasets with multiple modalities. Let me know if you have further questions.

hoseok-lee commented 2 weeks ago

Hi, thank you for the wonderfully informative tutorial! It has helped me set up my model for training. As a follow-up question, what is the exact effect of the three optional arrays id_datasets, batches_cate, and batches_cont? Although they are optional, I have noticed that they alter the model structure to some degree.

jaydu1 commented 2 weeks ago
  1. id_dataset (paired with mask) is mainly for computational considerations. When storing the mask as a num_samples-by-num_features array could be memory-consuming. Therefore, we provide another option that the user can pass mask as a num_datasets-by-num_features array, with id_dataset to indicate which dataset a sample is from. With this, we can get back the mask for the whole dataset with mask[id_dataset].
  2. batches_cate and batches_cont are (optional) covariates/batch effects that we want to adjust for in the latent space. For instance, if data are collected from multiple batches, they may exhibit unwanted variations that are irrelevant to the biological signals; batch correction is needed to remove such differences. In our model, this can be obtained by supplying categorical and continuous covariates to batches_cate and batches_cont, respectively. They will be used as 'conditions' to make sure the latent space z is free of batch effects, similar to CVAE.
  3. The adjustments with batches_cate and batches_cont are 'soft' in the sense that they do not guarantee exact merging. In case when combining data with high batch effects (e.g., cells from the same cell types in two datasets from different labs), the user can also provide an array of conditions to remove these differences more effectively, by also minimizing the MMD (Maximum Mean Discrepancy) loss.

Happy to explain more if anything is unclear:)