nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
https://nomic.ai/gpt4all
MIT License
69k stars 7.57k forks source link

[Feature] Process local files without localdocs #2186

Open mkammes opened 5 months ago

mkammes commented 5 months ago

Feature Request - use documents without localdoc processing

One such use case - such as docx data extraction to json - for cleaning data for fine-tuning models or for localdocs. This feature would necessitate access to a raw file, not post-indexing by the localdoc process. I find pdf and docx extraction has its limitations when using localdocs, so I'd like to clean the data and put it into a custom json schema.

cebtenzzre commented 5 months ago

Most of the local LLMs you can currently use in GPT4All have a maximum context length of 4096 tokens - feed them any more data, and information from the beginning of the document will be lost. Are you working with fairly small documents (under a few thousand words), or do you e.g. have a lot of VRAM and intend to use a model finetuned on very long contexts?

mkammes commented 5 months ago

Mainly short documents. My use case is manuals that have pdf/text software update docs; what's new, etc. While manuals aren't small, the update release docs are. Local processing is done with a 4070.