Open srdas opened 1 day ago
More generally, we need a way to not allow @file
to be called on binary blob files.
I was a bit lazy with this and thought i was being conservative by only supporting what was in jupyter_ai.document_loaders.directory.SUPPORTED_EXTS
.
I kind of assumed it was only some subset of text-based files and didn't notice .pdf was part of the list. So binary blobs in general should already be blocked.
If were to have a more comprehensive list, should it cover all text-based files or only code related ones? Like .log or .csv files may be very long and may accidentally blowup a token budget. Should it be up to the user to manage this risk themselves? or should we do a size check?
These were some questions I left to be solved in a future PR.
@michaelchia - Thanks for responding so quickly!
@file
command as much of the /learn
I do is for single files. As LLM context windows have grown, users are exploiting the longer context windows and @file
wonderfully makes this seamless; I'd say users are pretty aware of the cost issues now. However, the idea of a size check is a good one, so if the number of input tokens crosses a limit, say 2K, then pop up a warning and ask to proceed. Personally, I don't have any strong opinions whichever way on this. I'll leave it up to you guys to decide what should be supported.
Relying on file extensions is not a very reliable method of determining a file's type; see #1030.
I can help offer guidance on a plan for improving file compatibility in @file
and /learn
more generally, while still allowing extra enhancements for special files on a best-effort basis.
@file
or embedded via /learn
, without relying on the extension in the filename. How else can we rigorously, programmatically define what a plaintext file / how we determine a file to be plaintext?@file
and /learn
to behave like this: if a file is not readable plaintext, try to coerce it to a readable plaintext file on a best-effort basis, based on the file extension / MIME type./learn
to ignore files that are not readable plaintext and cannot be coerced to readable plaintext, instead of relying on a file extension allowlist.
The new feature
@file
throws an error when a PDF file is passed as context.The error arises as the
@file
command does not handle PDF files (as the encoding needs special handling).Suggested fixes: