jupyterlab / jupyter-ai

A generative AI extension for JupyterLab
https://jupyter-ai.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
3.19k stars 320 forks source link

`@file` throws error for PDF files #1044

Open srdas opened 1 day ago

srdas commented 1 day ago

The new feature @file throws an error when a PDF file is passed as context.

@file:GitHub/RAG_Docs/MertonHBR.pdf What does this file pertain to?

The error arises as the @file command does not handle PDF files (as the encoding needs special handling).

Traceback (most recent call last):
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/chat_handlers/base.py", line 226, in on_message
    await self.process_message(message)
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/chat_handlers/default.py", line 64, in process_message
    context_prompt = await self.make_context_prompt(message)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/chat_handlers/default.py", line 75, in make_context_prompt
    await asyncio.gather(
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/context_providers/base.py", line 159, in make_context_prompt
    return await self._make_context_prompt(message, commands)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/context_providers/file.py", line 61, in _make_context_prompt
    if (context := self._make_command_context(i))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/sanjivda/GitHub/jupyter-ai/packages/jupyter-ai/jupyter_ai/context_providers/file.py", line 90, in _make_command_context
    content = f.read()
              ^^^^^^^^
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Suggested fixes:

  1. Provide a graceful error message that PDF files are not handled.
  2. Extend the command to also handle PDF files by processing them into text and then passing them as context.
dlqqq commented 1 day ago

More generally, we need a way to not allow @file to be called on binary blob files.

michaelchia commented 1 day ago

I was a bit lazy with this and thought i was being conservative by only supporting what was in jupyter_ai.document_loaders.directory.SUPPORTED_EXTS.

https://github.com/jupyterlab/jupyter-ai/blob/9630742af21d51ddad6d66e5250415f778dac61b/packages/jupyter-ai/jupyter_ai/context_providers/file.py#L83-L87

I kind of assumed it was only some subset of text-based files and didn't notice .pdf was part of the list. So binary blobs in general should already be blocked.

If were to have a more comprehensive list, should it cover all text-based files or only code related ones? Like .log or .csv files may be very long and may accidentally blowup a token budget. Should it be up to the user to manage this risk themselves? or should we do a size check?

These were some questions I left to be solved in a future PR.

srdas commented 1 day ago

@michaelchia - Thanks for responding so quickly!

  1. I think it should cover all text files, not just code related ones. In fact, I have been using plain text files with the @file command as much of the /learn I do is for single files. As LLM context windows have grown, users are exploiting the longer context windows and @file wonderfully makes this seamless; I'd say users are pretty aware of the cost issues now. However, the idea of a size check is a good one, so if the number of input tokens crosses a limit, say 2K, then pop up a warning and ask to proceed.
  2. As of now, CSV files are not supported.
  3. Maybe update the code to check for files that are not encoded as plain text? This would trap PDF files. But if we want to handle PDFs, it would need a pdf2txt conversion to be added, which can be done separately.
michaelchia commented 1 day ago

Personally, I don't have any strong opinions whichever way on this. I'll leave it up to you guys to decide what should be supported.

dlqqq commented 22 hours ago

Relying on file extensions is not a very reliable method of determining a file's type; see #1030.

I can help offer guidance on a plan for improving file compatibility in @file and /learn more generally, while still allowing extra enhancements for special files on a best-effort basis.

  1. Clearly define what files can be included as context via @file or embedded via /learn, without relying on the extension in the filename. How else can we rigorously, programmatically define what a plaintext file / how we determine a file to be plaintext?
  2. (optional) Determine if it's possible to distinguish between "readable" plaintext (e.g. a documentation page in Markdown) and "unreadable" plaintext (e.g. a PGP private key) files.
  3. Modify @file and /learn to behave like this: if a file is not readable plaintext, try to coerce it to a readable plaintext file on a best-effort basis, based on the file extension / MIME type.
  4. Modify /learn to ignore files that are not readable plaintext and cannot be coerced to readable plaintext, instead of relying on a file extension allowlist.