Closed Rockavik closed 3 months ago
I am having the same issue
Same here. From the documentation: Enabling LocalDocs
Install the latest version of GPT4All Chat from [GPT4All Website](https://gpt4all.io/).
Go to Settings > LocalDocs tab.
Configure a collection (folder) on your computer that contains the files your LLM should have access to. You can alter the contents of the folder/directory at anytime. As you add more files to your collection, your LLM will dynamically be able to access them.
Spin up a chat session with any LLM (including external ones like ChatGPT but warning data will leave your machine!)
At the top right, click the database icon and select which collection you want your LLM to know about during your chat session.
That's what I did, but it won't answer with the knowledge I gave it (it generally answers, but I don't know if it hallucinates those answers or what, they're just not what's in LocalDocs at all). What am I missing?
Same here. I tried mac osx and win :-(
Same here.. win ver. At least i get one random answer from my docs one time, then, gpt4all stop using the documents..
gpt4all 2.4.19 could be better, but works for me.
Used the Mini Orca (small) language model. Had two documents in my LocalDocs. The first document was my curriculum vitae. The second document was a job offer. The model was able to use text from these documents as context and write a cover letter for a job application. This was the first time I tried this model and I am not an expert at using the proper syntax, so it took me two hours (!) conversing for it to finally come up with a cover letter that could be a crude template for my real cover letter. I found that my instructions had to be very precise, otherwise it would only give very generic responses that had nothing to do with what I asked it to and very often came up with excuses asking me for more context and details. It helped to rhetorically "attach" the CV to the sender (applicant) and the job offer to the receiver (company).
Here my final question (real names redacted):
"Write a cover letter one DINA4 page long in the name of sender to receiver. This cover letter is part of a job application for the position of XXX. For the cover letter refer to experiences mentioned in sender's CV. Keep in mind YYY."
I had this issue on MacOS. Ultimately I uninstalled the program (using AppCleaner) and removed the files in /Library/Application Support/nomic.ai (but not the model files) to give me a clean slate. Now it's working again.
I am seeing this as well, but am unsure if the indexing takes time, or is silent. Is there a way to test this by eliminating all knowledge other than local docs (sort of a crude RAG)?
I experimented some more:
If following comment from June 2023 in discord (https://discord.com/channels/1076964370942267462/1096503917262667827/1119879022290141224) is still correct, then content in LocalDocs will only be mentioned, if the snippet has at least three words in common with the prompt. o.O
Localdocs parameters
document-snippet-size N: Make every snippet N characters in length.
Document-snippet-size determines that all snippets extracted from the documents be N characters in length, setting it to something large(example: 512) increases the likelihood that the snippet will have at least 3 words in common with the prompt, which will lead to it being used as context that will be provided to the model during generation. But it will also massively slow down generation, as the model will need to chew through much more tokens of context before it can start to generate.
document-snippets-to-process-per-prompt N: Use only the N best snippets as the context provided to the model.
Document-snippets-to-process-per-prompt determines how many best matching snippets to add as context. Snippets are ranked based on how many exact-word matches(not case sensitive eg., "now" and "NoW" are a match, but not "Dog" and "Dogs") they have with the prompt. If there are no snippets that have at least 3(hard-coded into the app) matching words with the prompt, the prompt is sent straight to the model, without providing it with any snippets as context.
@Rockavik can you still reproduce, if you create a charactersheet that has every info mentioned three times?
First, thanks to ThiloteE for great tips. Second, I think most of "its not working for me" comments above, I believe, fall under the "wait for indexing to finish" category. Third, and this is my issue, how long does indexing takes for you guys? My folder has 100+ GB of PDFs. Over night, i.e. about 12 hours, it got to less then 10%, but it was using 20% of my MacBook Pro's M2 Pro the entire time. I am now trying with a subset of about 10gb. I am wondering if this is expected behaviour (it says somewhere wait for a minute or two for indexing)? Should I try a clean install?
Thanks for your input in advance Ben
This is expected. Adding a large collection to the index will take a long time. I would suggest to experiment with a single digit number of documents first, before you add a larger collection.
Just thought to mention a few thing Re indexing large data folders: 1) divided folders into 1GB chunks = cca. 1.5x speed 2) extractng txt from PDFs = cca. 2x speed 3) throwing it at MacStudio w/ M2 Max instead of MacBook Pro w/ M2 Pro = cca. 5x speed Bottom line: with a good desktop 1000 txt files = 125 MB will index in 2-3 days. Cheers Ben
There is definitely room for improving the indexing speeds - if it could run without a model loaded we would be able to use the GPU and greatly increase performance. Though I don't think our Vulkan backend currently supports Bert, so there is some work to be done there.
The database wouldn't be created in my case. I had the directory chosen in local documents setting. I tried removing the file part at the beginning of the directory link also. I tried a single text file with a single line in the directory.
Nothing makes anything show up in the Local Documents database. Restarting GPT4All resets the LocalDocs collections in settings to blank.
GPT 2.5.4
Nothing makes anything show up in the Local Documents database.
Try asking on our Discord. You may be missing a step.
I did miss a step. A really freaking obvious one too, ha.
So I configured the collection, but I never hit the button ADD on the side - because in my brain I was only testing with that one collection. Click ADD and now it's in there.
It's indexing it now.
Very much a PBKAC issue.
Maybe GPT4all have difficulty in analyzing PDF files when using local docs. Here is the test file: https://arxiv.org/abs/2005.11401
Base model: Mistral Opencra. Question: What is the meaning of RAG?
Exp1: (Without Using Local File)
Response1:
The term "RAG" typically refers to a system used for project management, called RAG Status. It stands for Red, Amber, Green (sometimes also known as Traffic Light System). This method uses colors to represent different levels of progress or status in a project, with each color representing a specific meaning:
1. Red (R): Indicates that the task is not on track and requires immediate attention or action. It may be due to delays, issues, or other problems that need resolution.
2. Amber (A): Represents tasks that are currently at risk or potentially off-track but still within a manageable range. These tasks require monitoring and might need additional resources or adjustments in the project plan.
3. Green (G): Signifies tasks that are on track, meeting expectations, and progressing as planned. This color indicates that no immediate action is needed for these tasks, but they should continue to be monitored.
The RAG Status system helps project managers and team members quickly understand the status of various tasks or projects at a glance, enabling better decision-making and more effective resource allocation.
Exp2: (Using Local File)
Response2:
RAG stands for Reasoning with Adaptive Knowledge Graph. It's an AI model designed to reason about information using a knowledge graph and adapt it based on new data. This allows it to generate answers or responses that are more contextually relevant, as opposed to simply searching through a predefined set of documents like traditional search engines do.
---
1. Date: 2023, December 17. In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf. Page 2. Context
2. Date: 2023, December 17. In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf. Page 5. Context
3. Date: 2023, December 17. In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf. Page 10. Context
Apparently, GPT4All didn't get the meaning of RAG from this file.
Exp3: (Using Local File + .txt file)
In the local folder, I add another file RAG.txt with following contents:
RAG denotes: Retrival-Augmented Generation
And then, we asked the same question to get the following answer:
RAG stands for "Retrieval-Augmented Generation." It's a model that combines retrieval and generation capabilities, allowing it to access external knowledge from text sources while generating responses. This approach provides more control and interpretability, making it suitable for various scenarios with direct benefits to society.
---
1. Date: 2023, December 17. In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf. Page 2. Context
2. Date: 2023, December 17. In Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf. Page 10. Context
3. Date: 2023, December 17. In RAG.txt. Context
So, maybe there is something wrong with analyzing PDF files. Hope above experiments helps 😄
Can you specify where the local docs db is created so that they can be in a sperate directory than the location of the LLM gguf that was used?
Had the same problem before, relaized that the download path folder also consist of other data including chache. NOTE: This was on my windows pc.
Deleted all files including the embeddings_v0.dat, which solved the indexing and embedding issue. Perhaps you can just delete the embeddings_vX.dat file, which should solved it.
Would recommend to add an embeddings deletion function, which forces the current embeddings file to be deleted.
Part of the issue might be a change in gpt4all behavior. In the past, when you created a localdocs db, it showed the db and had a bar that showed embedding progress. That bar is gone and the directory does not appear. Is this a bug, a change in behavior, or something else?
@jafrank88 Have you clicked the "add" button? In GPT4All 2.7.0, the progress bar (indexing) now shows, if you click on the local documents button in the chat gui. I don't think you see it in the settings gui.
It worked once I deleted the localdocs_v1.db as well as the embeddings_v0.dat
Also relevant: https://github.com/nomic-ai/gpt4all/issues/1958
Throwing my hat into the ring. gpt4all 2.7.1 localdocs provides hallucination responses when using GPT-Falcon. same prompt works fine with mistral-7b-openorca.Q4_0
This issue is no longer relevant given the old context of a very old version of localdocs that has since been replaced by v3.0
System Info
GPT4ALL v2.4.19 Anaconda3 Python 3.10.9
Information
Related Components
Reproduction
Installed GPT4ALL Downloaded GPT4ALL Falcon Set up directory folder called Local_Docs Created CharacterProfile.txt with information regarding a character. Saved in Local_Docs Folder In GPT4All, clicked on settings>plugins>LocalDocs Plugin Added folder path Created collection name Local_Docs Clicked Add Clicked collections icon on main screen next to wifi icon. Ticked Local_Docs Talked to GPT4ALL about material in Local_docs GPT4ALL does not respond with any material or reference to what's in the Local_Docs>CharacterProfile.txt
Expected behavior
GPT4All should respond with references of the information that is inside the Local_Docs> Characterprofile.txt file.