Mechanism for working with vision-based models

daedsidog commented 3 months ago

A lot of times I need to pass information to ChatGPT that I can't copy, such as a snippet from an old, scanned document or formatted mathematics.

Right now I have a manual process where I query a visual model (though a website) so that he tells me what he sees. E.g., when I paste him a snippet of mathematics, he will give me LaTeX code, which I then pass down to ChatGPT.

Would be very nice to have something like this.

karthink commented 3 months ago

I forget, do you use org-mode or markdown with gptel?

On Sun, Mar 10, 2024, 7:27 AM daedsidog @.***> wrote:

A lot of times I need to pass information to ChatGPT that I can't copy, such as a snippet from an old, scanned document or formatted mathematics.

Right now I have a manual process where I query a visual model (though a website) so that he tells me what he sees. E.g., when I paste him a snippet of mathematics, he will give me LaTeX code, which I then pass down to ChatGPT.

Would be very nice to have something like this.

— Reply to this email directly, view it on GitHub https://github.com/karthink/gptel/issues/244, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBVOLFYMVX46CWSCGOBZ43YXRUW7AVCNFSM6AAAAABEPASGXKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGE3TONZUGUYTKNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

daedsidog commented 3 months ago

I used to use Org mode but I switched to Markdown because I was tired of gptel sometimes doing weird things (like removing underscores). I think it's fixed in the latest version, but I haven't switched back.

Why is that relevant, though?

karthink commented 3 months ago

I used to use Org mode but I switched to Markdown because I was tired of gptel sometimes doing weird things (like removing underscores). I think it's fixed in the latest version, but I haven't switched back.

It should be fixed now, yeah.

Why is that relevant, though?

It's easier to support vision models in Org mode. That said, please see the discussion in #231.

daedsidog commented 3 months ago

It's easier to support vision models in Org mode. That said, please see the discussion in #231.

Interesting.

I honestly think the most power from gptel comes with just the abstraction layer it provides when interacting with various models. I, for one, have completely eliminated the process of manually typing code by implementing context generation. Below is a demonstration of me constructing a context buffer, and with a keypress I use gptel's replace-in-place with it. Works exceedingly well. This is kind of its own separate thing from gptel, but I was wondering if the scope of gptel should include this sort of thing.

contexter

What I want now is just a way, totally unrelated to Org mode or MD, which will allow me to "send" ChatGPT queries with images (i.e., send current image saved on clip) and get an input in place.

karthink commented 3 months ago

I honestly think the most power from gptel comes with just the abstraction layer it provides when interacting with various models

I see.

Below is a demonstration

Sorry, I had trouble following your demo. My best guess is that the buffer on the right is sent as the context (or system message), and you're asking it to do something with those functions.

I don't understand how it knows what "special function" means, or what "process" means,
or what the utility of this approach is in more general situations.

What I want now is just a way, totally unrelated to Org mode or MD, which will allow me to "send" ChatGPT queries with images (i.e., send current image saved on clip) and get an input in place.

I'm not sure gptel is set up to do that -- it's a very buffer-oriented system. At minimum it will need to distinguish between text as text and text that represents a file path and act on the file instead. A common way to do this would be to define a gptel-send-image command, but I'm not interested in growing the command surface area of gptel.

Basically, handling images is not ruled out, but right now I don't know the best way of doing so that conforms to a simple mental model like the chat usage does.

karthink commented 3 months ago

I was wondering if the scope of gptel should include this sort of thing.

I'm interested to understand what you mean here -- I just had trouble following the demo.

daedsidog commented 3 months ago

I was wondering if the scope of gptel should include this sort of thing.

I'm interested to understand what you mean here -- I just had trouble following the demo.

My apologies, my explanation was terrible.

The demo showcases a way to mark areas in different buffers, and aggregate them in their own dedicated buffer. That buffer can then be copied and handed to gptel as context. This is much easier than manually copy pasting sections of context into the dedicated chat buffer/external ChatGPT website, and also has the added bonus of minimizing the context by collapsing code that doesn't contribute to the context.

You can manually remove context snippets from the context buffer.

In a nutshell, it's a glorified yanker, but I found it incredibly useful.

I'm interested to understand what you mean here

I am wondering if you would be open for this to be integrated into gptel, or should this remain its own separate package. It's pretty useless outside of gptel, though.

karthink commented 3 months ago

I am wondering if you would be open for this to be integrated into gptel, or should this remain its own separate package. It's pretty useless outside of gptel, though.

I like the idea! I'll have to think about how to integrate it into gptel though. Right now the best idea I have is "Add an option to the transient menu to append a selected region to the system prompt". This won't work well across buffers since each buffer has its own system prompt.

You've developed a more sophisticated UI for this style of usage, it's interesting.

karthink commented 3 months ago

Converting to a discussion since there's nothing to fix in gptel right now.

karthink / gptel

Mechanism for working with vision-based models #244