hyn2028 / llm-cxr

Official code for "LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation"
https://arxiv.org/abs/2305.11490
Apache License 2.0
106 stars 10 forks source link

Generalize the image domain, and, generate multi latent images. #2

Closed QLaHPD closed 5 months ago

QLaHPD commented 1 year ago

First of all, congratulations for the work, I believe that this network has a much higher zero-shot learning potential than the diffusion based ones.

Now the question, what would it take to generalize this network, so that it would generate images from any domain? Would it be the same process for CXR? And, for the network to generate multiple latent spaces (e.g. VQ-VAE2), would it be simple to align them in the dataset, so that it generates both in sequence?

one-june commented 1 year ago

Hello! Thank you for your comment :D I am one of the authors of the paper, and I actually have these same questions as well! As for your first question, I suspect that the approach we used here could be applied to other domains such as natural images. We just chose to use CXRs because 1) it seemed to have the most direct use case and 2) the MIMIC dataset was a readily available dataset that has free text reports describing each image.

But more importantly, I think your second question is very interesting. Here, the alignment between image and text occurs simply by training bidirectionally (ie, generating images given some input text & generating text given some input image). To be honest, we don't really understand what exact mechanisms are going on in the latent space that led to this result. I am definitely curious to investigate this, and changes to the latent space (such as using VQ-VAE2) should give us some more clues; I think that only more experimentation will give us better answers :)