karthink / gptel

A simple LLM client for Emacs
GNU General Public License v3.0
1.36k stars 136 forks source link

Select a PDF file as context #434

Open WeissP opened 2 days ago

WeissP commented 2 days ago

Right now, one can use the GPT-4o API to upload a PDF file and ask questions about it. Here's an example of how to do this with Python.

It would be great if gptel could allow users to select a PDF as context and have the AI explain a selected region in pdf-view-mode, similar to what gptel-quick does.

Is it currently possible? if not, what functionalities need to be implemented? I am happy to do some contributions. If you believe this isn't the direction for this package, I can create a new package instead, and it would be really helpful if you could give me some guidance on how to go about that or point me to which part of the gptel code I should check out.

karthink commented 2 days ago

There are three versions of this feature:

  1. Select some text in a pdf-view buffer and add it to gptel's context, as text. This is easy to add to gptel. Since you've seen the implementation in gptel-quick, we can add it the same way to gptel.

  2. Send the current PDF view (i.e. current page), but as an image to a model that supports images. Also easy to add to gptel.

  3. Use OpenAI assistants API to set up a session and include files. OpenAI will then use these files as part of a RAG pipeline. This is what the Python code in your example does. I think this is out of scope for gptel. However, I have plans to make it easy to set up RAG pipelines with gptel. It will probably be an add-on package, and will support fully local RAG, along with the ability to plug in other RAG approaches like those provided by OpenAI and Gemini. This is a pretty extensive project though, and I don't have the time to work on it for a while. This package will do quite a bit more than what you're looking for, but let me know if you're interested in authoring it nevertheless.

If you want to add 1 or 2 to gptel, PRs are welcome. To begin with I'd read through the file gptel-context.el, focusing on the functions gptel-add, gptel-context--collect and the variable gptel-context--alist, which holds the context chunks.


As a side note, you can already select a PDF file as context if you use the Gemini models. However, this is not RAG -- the entire PDF file is parsed with each request, so this is probably best used for one-off requests or very short conversations.

WeissP commented 2 days ago

Oh what I want is the third option since I mostly need AI to help me understand academic papers and I have a lot of questions. Sadly, I don't think I have enough time to write and to maintain a standalone package if it is the same level as gptel. But do you already have any thoughts about the third option? I might start with some simple code that meets my personal needs and see if it can grow into a package later.

karthink commented 1 day ago

For the third option, I think you'll need a different tool. I haven't kept up with the state of things, but perhaps something like Khoj? There are many more like it, I think.

If the tool provides an HTTP API, it might be possible to continue to use gptel in Emacs to interact with it.

WeissP commented 1 day ago

I looked into Khoj, and it seems like they just convert PDFs into text without using any assistants or sessions.

If the tool provides an HTTP API, it might be possible to continue to use gptel in Emacs to interact with it.

What if I set up a small server to manage PDF files and sessions, and then I use gptel to communicate with that server?

WeissP commented 20 hours ago

By the way, it seems like private-gpt also supports PDF uploading (I haven't yet check whether it uses session and RAG under the hood). Since gptel can interact with private-gpt, I am curious whether it is possible to ask questions regarding PDF files via private-gpt and gptel?