pix2code: Generating Code from a Graphical User Interface Screenshot

Summary:

Decomposed the problem in three steps:

a computer vision problem of understanding the given scene and inferring the objects present, their identities, positions, and poses.
a language modeling problem of understanding computer code and generating syntactically and semantically correct samples.
use the solutions to both previous sub-problems by exploiting the latent variables inferred from scene understanding to generate corresponding textual descriptions of the objects represented by these variables.

They also introduce a Domain Specific Languages (DSL) for modeling purposes.

Vision model: usual AlexNet-like architecture
Language model: use onehot encoding for the words in the DSL vocabulary which is then fed into a LSTM
Combined model: LSTM too.

Clearly not ready for any serious use but promising results!