pinellolab / DNA-Diffusion

🧬 Generative modeling of regulatory DNA sequences with diffusion probabilistic models 💨
https://pinellolab.github.io/DNA-Diffusion/
Other
362 stars 52 forks source link

Consider DNA embedding models to transition to use stable diffusion #37

Closed kingstut closed 1 year ago

kingstut commented 1 year ago

Create a CLIP-like embedding+tokenizer for our usecase that can be hooked up to stable diffusion pipeline

lucapinello commented 1 year ago

@kingstut please coordinate with @sg134 he is currently exploring a VQ-VAE for sequences representation that can be used as a key component for stable diffusion.

sg134 commented 1 year ago

Hi @kingstut , we recorded our meeting (link is in the DNA Diffusion Discord) regarding current progress and next steps for creating DNA embeddings. We'll coordinate tasks for this part of the project during the next sprint/half-sprint meeting. But to briefly summarize here are some of the key takeaways for next steps from our meeting:

kingstut commented 1 year ago

Hi @sg134 and @lucapinello, thanks for that! What I am working on is slightly different though. Right now our diffusion models take in just the raw_sequence part of the data and trains the model to denoise it. As mentioned in the meeting, VQ-VAE is an additional part to the model that will make the training/inference faster by compressing the representation space. At some point, we want to include other columns of the dataset for eg, the C1 to C16 values, so that the raw_sequence can be generated conditioned on an input C1-C16 query. Think of this exactly like stable diffusion except the text prompt in this case is the C1-C16 values and the generated image will be the raw sequence.

Stable diffusion uses CLIP (https://openai.com/blog/clip/) which is a model that learns similarity between textual and visual information. We would need a similar embedding model that finds similarity between component information and raw sequence. This will then be plugged into the diffusion model so that we can generate raw sequences from our queries.

I can give more information and update on my progress in the next meeting :)

lucapinello commented 1 year ago

Thanks a lot for the explanation. This is exciting and I am looking forward to learning more about your progress on this!

On Fri, Oct 28, 2022 at 1:06 AM Stuti R. @.***> wrote:

Hi @sg134 https://github.com/sg134 and @lucapinello https://github.com/lucapinello, thanks for that! What I am working on is slightly different though. Right now our diffusion models take in just the raw_sequence part of the data and trains the model to denoise it. As mentioned in the meeting, VQ-VAE is an additional part to the model that will make the training/inference faster by compressing the representation space. At some point, we want to include other columns of the dataset for eg, the C1 to C16 values, so that the raw_sequence can be generated conditioned on an input C1-C16 query. Think of this exactly like stable diffusion except the text prompt in this case is the C1-C16 values and the generated image will be the raw sequence.

Stable diffusion uses CLIP (https://openai.com/blog/clip/) which is a model that learns similarity between textual and visual information. We would need a similar embedding model that finds similarity between component information and raw sequence. This will then be plugged into the diffusion model so that we can generate raw sequences from our queries.

I can give more information and update on my progress in the next meeting :)

— Reply to this email directly, view it on GitHub https://github.com/pinellolab/DNA-Diffusion/issues/37#issuecomment-1294462870, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIH72SJ4SIOEZHC5AC3MGDWFNNOBANCNFSM6AAAAAARN54MSE . You are receiving this because you were mentioned.Message ID: @.***>

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.