FR: allow multimodal input / vision / images

It would be simple to make it so that in the prompt text paths/urls to images are replaced by image call.

I could then for example add a shortcut so that images that are in my clipboard could be pasted to /tmp and add a path automatically.

See the kind of workflow implemented in ollama:

What's in this image? /Users/jmorgan/Desktop/smile.png
The image features a yellow smiley face, which is likely the central focus of the picture.

Somewhat related to:

Edit: Oh I see that there's already partial support there: https://github.com/jackMort/ChatGPT.nvim/pull/332

It should be :

jackMort / ChatGPT.nvim