Add support for vision models

After we attempted forking off llm-api, we currently need to write cases for integrating vision models with the core browse loop.

In the case of llm-api, the entire library is constructed around user messages being a string, and image data creates a sub-type where we can either return a string or an array of messages, and this ended up being a type mess to unwrangle.