nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.

https://nomic.ai/gpt4all

MIT License

70.79k stars 7.71k forks source link

[Feature] LocalDocs support for CSV, JSON, XML #2059

Open tbennett6421 opened 8 months ago

tbennett6421 commented 8 months ago

Feature Request

Please add to the roadmap for gpt4all-localdoc, the ability to parse csv, json, xml files. LLM models are prone to making garbage up, so I intended to use localdocs to provide databases of concrete items. Generally most of these formats will be in csv, json, or xml.

LocalDocs currently supports plain text files (.txt, .md, and .rst) and PDF files (.pdf).

Example use cases:

dumping logs into a folder, and asking questions about the data.
dumping databases into a folder, requesting experimental data such as (mw, mp/fp, solubility)
dumping financial spreadsheets, and asking questions about transcripts.
and more

manyoso commented 8 months ago

Localdocs currently does not have any support for custom file parsing though this would be a nice addition.

mishaxz commented 8 months ago

I concur, right now you have to rename you .csv files to .txt

btw. does anyone know what the fastest models are for this kind of thing, I'm using Nous Hermes 2 Mistral DPO right now on the txt csv file but it is kind of slow.

tbennett6421 commented 8 months ago

What would it take to implement some kinda parser in the localdocs? I mean I'd be willing to look at doing a pr for it? Either in python or in c?

tbennett6421 commented 8 months ago

1344 could help address one of those points above:

dumping databases into a folder, requesting experimental data such as (mw, mp/fp, solubility)

specifically when gpt hallucinates or makes up empirically measured data.

cebtenzzre commented 8 months ago

Localdocs currently does not have any support for custom file parsing though this would be a nice addition.

Since these are plain text formats, a minimum effort implementation would be to just add these formats back to the whitelist. At the time I removed them, I wanted to start with a clean slate because there were a lot of formats in that list that even if they worked, didn't seem like anyone would be using them.

Although I don't think it makes sense to use the LocalDocs feature as-is to process structured input, since it breaks it into chunks and destroys the global structure... it clearly worked well enough for a few people in the past. A slightly more useful implementation would e.g. keep the header for snippets of CSV, and keep the outer structure for XML and JSON.

spiralofhope commented 2 months ago

I see both csv and xml listed in database.cpp

So if I understand this feature request, it is asking for proper support for the data as a data format and not just as a dump of plain text words. Is that right?

tbennett6421 commented 2 months ago

CSV hasn't been the worst. For instance, I've dropped big csvfiles without headers and it does a well enough job getting what I mean out of it. My big issue is consuming json, xml structured input.

Generally I write something to consume data from some endpoint/url as json(). Then I painstakingly try to convert the results to markdown or plaint text then add it to my localdoc collections. Generally I have to write unique "data-massaging" scripts to support data collection efforts. If I dont do that before hand, I get giant database files which impacts the perfomance of GPT4ALL.

That's my specific use case.