Generalize the image domain, and, generate multi latent images.

Hello! Thank you for your comment :D I am one of the authors of the paper, and I actually have these same questions as well! As for your first question, I suspect that the approach we used here could be applied to other domains such as natural images. We just chose to use CXRs because 1) it seemed to have the most direct use case and 2) the MIMIC dataset was a readily available dataset that has free text reports describing each image.

But more importantly, I think your second question is very interesting. Here, the alignment between image and text occurs simply by training bidirectionally (ie, generating images given some input text & generating text given some input image). To be honest, we don't really understand what exact mechanisms are going on in the latent space that led to this result. I am definitely curious to investigate this, and changes to the latent space (such as using VQ-VAE2) should give us some more clues; I think that only more experimentation will give us better answers :)

hyn2028 / llm-cxr

Generalize the image domain, and, generate multi latent images. #2