Open hoseok-lee opened 1 month ago
Hi, thanks for your interest. I have added a Notebook that uses toy data examples to illustrate how to prepare data input for scVAEIT. It contains utility functions to concatenate multiple datasets with multiple modalities. Let me know if you have further questions.
Hi, thank you for the wonderfully informative tutorial! It has helped me set up my model for training.
As a follow-up question, what is the exact effect of the three optional arrays id_datasets
, batches_cate
, and batches_cont
? Although they are optional, I have noticed that they alter the model structure to some degree.
id_dataset
(paired with mask
) is mainly for computational considerations. When storing the mask as a num_samples-by-num_features array could be memory-consuming. Therefore, we provide another option that the user can pass mask
as a num_datasets-by-num_features array, with id_dataset
to indicate which dataset a sample is from. With this, we can get back the mask for the whole dataset with mask[id_dataset]
.batches_cate
and batches_cont
are (optional) covariates/batch effects that we want to adjust for in the latent space. For instance, if data are collected from multiple batches, they may exhibit unwanted variations that are irrelevant to the biological signals; batch correction is needed to remove such differences. In our model, this can be obtained by supplying categorical and continuous covariates to batches_cate
and batches_cont
, respectively. They will be used as 'conditions' to make sure the latent space z
is free of batch effects, similar to CVAE.batches_cate
and batches_cont
are 'soft' in the sense that they do not guarantee exact merging. In case when combining data with high batch effects (e.g., cells from the same cell types in two datasets from different labs), the user can also provide an array of conditions
to remove these differences more effectively, by also minimizing the MMD (Maximum Mean Discrepancy) loss.Happy to explain more if anything is unclear:)
Hi,
I'm currently looking to use scVAEIT to integrate two different datasets, CITE-seq (containing scRNA-seq and protein abundance) and scMultiome (containing scRNA-seq and scATAC-seq). They have mutually exclusive cells, but I was hoping to use the scRNA-seq modality present in both datasets to integrate all three modalities (scRNA-seq, protein abundance, chromatin-accessibility).
As all of the tutorials showcase integration for cases where there are equal or more datasets being integrated than modalities (for example, the trimodality merge contains DOGMA-seq, ASAP-seq, and CITE-seq), I was wondering how I could set up the configuration of the model for the case where there are more modalities than datasets. For example, I'm not sure how to approach how to set-up the
batches
andid_datasets
configuration, in addition to themasks
matrix.Any help is appreciated! Thank you in advance :)