Closed clee700 closed 1 year ago
Hi,
Thanks for the question.
For the data, you could follow the instructions in the data folder to get file dogma_cite_asap.h5
. Due to the restriction on file sizes on Github, unfortunately I cannot upload it. If you can provide an email address, I can share it with you through Google Drive.
In the example, basically, the network is operated in two levels of blocks:
dim_input_arr
and uni_block_names
(meaning that they have the same length).dist_block
, dim_block_embed
, dim_block_enc
, dim_block_dec
, and block_names
.We explain the parameters as below:
dim_input_arr
represents the size of input features. In the example, it is simply $[n_g, n_a, n_p]$.
dim_block
represents the number of subconnected features in all modality (assuming that the features have been rearrange accordingly). In the example, it is $[n_g, n_a, n_p^1, n_p^2, \ldots]$.
dist_block
: There are four distributions implemented: 'NB', 'ZINB', 'Bernoulli', 'Gaussian' for negative binomial, zero-inflated negative binomial, Bernoulli, and Gaussain, respectively. However, only 'NB' and 'Bernoulli' are tested and used to generate the results in the paper. 'Bernoulli' is used for ATAC-seq data, and 'NB' is used for genes and proteins.
dim_block_embed
represents the embedding dimension of the binary mask. For example, dim_block_embed = [1, 2, 3, ...]
means the mask will be embedded into a continuous vector of dimension 1 for block 1, and so on.
dim_block_enc
represents the structure of the first latent layer of the encoder. Using skip-connection, it helps reduce memory and computation complexity.
In the example, dim_block_enc = np.array([256, 128] + [16 for _ in chunk_atac])
means that the genes will be embedded into a vector of dimension 256, the adts will be embedded into a vector of dimension 128, and so on.
For block i
, we have a sub-network that takes both the features of size dim_input_arr[i]
and the mask embedding of size dim_block_embed[i]
and outputs a vector of size dim_block_enc[i]
.
After that, the embedding vectors in all blocks will be concatenated into a vector as the input to the encoder.
Similarly, dim_block_dec
represents the structure of the last latent layer of the decoder. For block i
, we have a sub-network that takes latent features of size dim_block_dec[i]
and outputs a vector (the predicted features) of size dim_input_arr[i]
.
dimensions
and dim_latent
specify the network structure in the middle. For example, dimensions = [256, 128]
and dim_latent = 32
mean that we have a network $n{in}-256-128-32-128-256-n{out}$ where $n_{in}$ is the sum of dim_block_enc
, and $n_{out}$ is the sum of dim_block_dec
.
Some of the important hyperparameters are:
beta_unobs
represents that weight for unobserved features.p_feat
represents the probability of masking for the individual features. The larger value of p_feat
encourages imputation ability, but also requires more training epochs to have a good performance. But the influence of it is not large when training for enough epochs, so we recommend fixing fix p_feat
as any reasonable value, e.g. 0.2. p_modal
represents the probablity of masking out one modality. You can just leave it as a uniform.In our experiments, the results were not sensitive to the above parameters. So you can just use reasonable values as in the example, except the following parameter requires some care depending on your data:
beta_modal
represents the importance of each modality. You run the model on your dataset for a few epochs, and pick beta_modal
such that the likelihoods (which will be printed during training) of all modalities are roughly in the same order. Notably, the number of peaks is generally very large, so its likelihood will have a higher value. And that is why you can see it has a small weight 0.01, in the example where beta_modal = [0.14,0.85,0.01]
.Let me know if you have any question.
Hello, thank you for this response. I tried to run preprocess_data.py, but it didn't look like the file DOGMA_pbmc.h5 was in the data folder. Or, I would be grateful if I could have the file dogma_cite_asap.h5 over Google Drive - my email is esb5324@psu.edu.
I have shared the file. Let me know in case you didn't get it.
Close the issue for now.
Thanks very much!
Elle Tang Pronouns: she/her/hers Statistics Ph.D. student
From: Du Jinhong @.> Sent: Wednesday, August 23, 2023 8:07 PM To: jaydu1/scVAEIT @.> Cc: Tang, Elle Salina @.>; Comment @.> Subject: Re: [jaydu1/scVAEIT] Missing data for tutorial and question about config (Issue #1)
You don't often get email from @.*** Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification
I have shared the file. Let me know in case you didn't get it.
Close the issue for now.
— Reply to this email directly, view it on GitHubhttps://github.com/jaydu1/scVAEIT/issues/1#issuecomment-1690797784, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANQQ3YNJLNQATTUEYC4MHILXW2LMVANCNFSM6AAAAAAX5ARGSA. You are receiving this because you commented.Message ID: @.***>
Hello, thank you for your work with mosaic integration. I was attempting to run the demo, but found the 'dogma_cite_asap.h5' file is not in the data file. Could you please upload it?
I also had a question about the configuration. Could you please explain the parameters listed, particularly dist_block (what distributions are available for use?) and dim_block_enc and dim_block_dec?
Additionally, for the betas and p_feat and p_modal, are those parameters that I should adjust for my dataset, and if so how can I select these?
config = { 'dim_input_arr': dim_input_arr, 'dimensions':[256], 'dim_latent':32, 'dim_block': np.append([len(gene_names),len(ADT_names)], chunk_atac), 'distblock':['NB','NB'] + ['Bernoulli' for in chunk_atac], 'dim_blockenc':np.array([256, 128] + [16 for in chunk_atac]), 'dim_blockdec':np.array([256, 128] + [16 for in chunk_atac]), 'blocknames':np.array(['rna', 'adt'] + ['atac' for in range(len(chunk_atac))]), 'uni_block_names':np.array(['rna','adt','atac']), 'dim_blockembed':np.array([16, 8] + [1 for in range(len(chunk_atac))])*2,
}
Thank you!