pinellolab / DNA-Diffusion

🧬 Generative modeling of regulatory DNA sequences with diffusion probabilistic models 💨
https://pinellolab.github.io/DNA-Diffusion/
Other
366 stars 52 forks source link

Training a VQ-VAE for DNA-sequences for stable diffusion #16

Closed lucapinello closed 1 year ago

lucapinello commented 2 years ago

Current notebook here: https://github.com/pinellolab/DNA-Diffusion/blob/latent-space-representation/vq_vae_diffusion.ipynb

lucapinello commented 2 years ago

@sg134 can you write here where you are with this and how people can contribute/help you?

mihirneal commented 2 years ago

@lucapinello Would love to work on this however I don't understand why do we need to work with VQ-VAE. Shouldn't we directly prototype with DDPMs?

lucapinello commented 2 years ago

The idea is to derive a good embedding for DNA-sequences so we can explore later stable diffusion. Right now we are diffusing directly on the one-hot-encoding of the DNA sequences.

sg134 commented 2 years ago

Hi, just saw this (sorry). As Luca mentioned, we hope to represent the DNA sequences in a smaller latent space and pursue latent diffusion. To that end, if you have any other model suggestions to encode the sequences into a representation (another VAE variant for example), feel free to suggest to suggest and implement them -- we don't necessarily know if VQ-VAE would be the best model for this dataset. I started with this model because it was used in the DALL-E paper. Currently some of the next steps planned for the VQ-VAE:

  1. Modify the architecture to improve the reconstruction accuracy of nucleotides in the dataset
  2. This is the big one that I've been stuck on: "interpreting" the codebook embeddings. Does the information in the codebook confer any information regarding TF binding motifs or key features differentiating between binding patterns across cell-types??
  3. Down the line, we'd also probably need to clean up the code and modify it a bit so that it's easy to combine this code with the diffusion code into a unified pipeline.

@lucapinello @LucasSilvaFerreira Is there anything else to include or clarify?

mihirneal commented 2 years ago

gotcha. I'd like to work on this issue. Can you assign it to me?

LucasSilvaFerreira commented 2 years ago

@mihirneal and @sg134 I would recommend that you guys create a subgroup to explore it together. @sg134 already has some code, and it would be nice if he can guide you through it. I think it will be nice to have a (latent) stable diffusion model working on these sequences.

mateibejan1 commented 2 years ago

@sg134 let me know if I can help with the VQVAE

sg134 commented 2 years ago

Hi @mihirneal & @mateibejan1, is it possible that we can allocate a few minutes during the sprint meeting to discuss the VQ-VAE code and next steps for others who are interested as well?

mihirneal commented 2 years ago

Yeah, that’s what I had in mind as well.

mateibejan1 commented 2 years ago

Sure, I'll devise a meeting planning. We'll start with a retrospective about what has been done in sprint 1, then talk current tasks and finally what we'll do. Sounds good for you schedule @sg134 ?

noahweber1 commented 1 year ago

@sg134 please contact me when you see this.

noahweber53@gmail.com

thanks

sg134 commented 1 year ago

@noahweber1 messaged you on Discord.

noahweber1 commented 1 year ago

Summary of what we agreed upon and what are the next steps:

  1. I take over for couple of weeks until Sameer comes and we close the task off.
  2. I perform refactoring cleaning
  3. Any improvements in accuracy I can squeeze out
  4. Any adjustments in architecture
  5. Explainability of the inference, i.e. that the latent representations actually makes sense when inspecting manually.
github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.