Open tbennett6421 opened 8 months ago
Localdocs currently does not have any support for custom file parsing though this would be a nice addition.
I concur, right now you have to rename you .csv
files to .txt
btw. does anyone know what the fastest models are for this kind of thing, I'm using Nous Hermes 2 Mistral DPO right now on the txt csv file but it is kind of slow.
What would it take to implement some kinda parser in the localdocs? I mean I'd be willing to look at doing a pr for it? Either in python or in c?
dumping databases into a folder, requesting experimental data such as (mw, mp/fp, solubility)
specifically when gpt hallucinates or makes up empirically measured data.
Localdocs currently does not have any support for custom file parsing though this would be a nice addition.
Since these are plain text formats, a minimum effort implementation would be to just add these formats back to the whitelist. At the time I removed them, I wanted to start with a clean slate because there were a lot of formats in that list that even if they worked, didn't seem like anyone would be using them.
Although I don't think it makes sense to use the LocalDocs feature as-is to process structured input, since it breaks it into chunks and destroys the global structure... it clearly worked well enough for a few people in the past. A slightly more useful implementation would e.g. keep the header for snippets of CSV, and keep the outer structure for XML and JSON.
I see both csv
and xml
listed in database.cpp
So if I understand this feature request, it is asking for proper support for the data as a data format and not just as a dump of plain text words. Is that right?
CSV hasn't been the worst. For instance, I've dropped big csvfiles without headers and it does a well enough job getting what I mean out of it. My big issue is consuming json, xml structured input.
Generally I write something to consume data from some endpoint/url as json(). Then I painstakingly try to convert the results to markdown or plaint text then add it to my localdoc collections. Generally I have to write unique "data-massaging" scripts to support data collection efforts. If I dont do that before hand, I get giant database files which impacts the perfomance of GPT4ALL.
That's my specific use case.
Feature Request
Please add to the roadmap for gpt4all-localdoc, the ability to parse csv, json, xml files. LLM models are prone to making garbage up, so I intended to use localdocs to provide databases of concrete items. Generally most of these formats will be in csv, json, or xml.
LocalDocs currently supports plain text files (.txt, .md, and .rst) and PDF files (.pdf).
Example use cases: