nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
https://nomic.ai/gpt4all
MIT License
69.66k stars 7.62k forks source link

Collections that are missing embeddings can get stuck that way until an explicit re-index #2273

Closed 3Simplex closed 2 months ago

3Simplex commented 5 months ago

Bug Report

Pre-existing collections from before the update to 2.7.4 do not work after update. Only collections created in 2.7.4 work.

Steps to Reproduce

  1. Create collection in version 2.7.3 or older.
  2. Update to 2.7.4
  3. Start new chat
  4. Select LocalDocs Collections that were made before the update.
  5. Reference all collections in chat to pull context from each collection. Screenshot 2024-04-26 172417
  6. LocalDocs will not find contents in collections made before the update to 2.7.4 Screenshot 2024-04-26 172912
  7. Create and add a new collection using 2.7.4
  8. Include new collection in selected LocalDocs collections
  9. Reference all collections in chat to pull context from each collection. Screenshot 2024-04-26 164521
  10. LocalDocs will find contents in newly created collections only. Screenshot 2024-04-26 163344

Expected Behavior

All collections were expected to function as usual.

Your Environment

SINAPSA-IC commented 5 months ago

I second this.

The program starts searching the selected collections...

tried this with 4 collections, to spot the fraction-of-a-second long text message "searching in localdocs:..."

...but immediately switches to the /default "generating response..." and "processing..." without parsing the collections which were however mentioned in the beginning but without them being really used (redundant here, but this is the idea :) )

cebtenzzre commented 5 months ago

I am able to reproduce this issue using a copy of some of 3Simplex's collections. It seems like the embeddings are missing for certain documents, due to the process getting interrupted somehow. These documents would have been re-indexed on every launch in previous versions of GPT4All because their modification timestamp did not match the database. Now they are only re-indexed the first time GPT4All v2.7.4 is started, and if that did not succeed then the collections will be broken until they are once again re-indexed (e.g. by changing the document snippet size) and it completes successfully.

We need to implement a way to know whether embeddings have been generated for a chunk so the program can continue where it left off.

SINAPSA-IC commented 5 months ago

I have also done as 3Simplex said, in the sense of changing a folder's contents as a collection, here's what I've done:

Done this with 3 distinct files in 3 distinct folders/categories. The result was the same - those collections were reindexed.

However, the issue is still here, - of reindexing existing collections. I see several collections being indexed again, immediately after program start, which were created even before 2.7.3 (I can't remember, was it 2.6.1 or a 2.7.x) and stayed that way since then...

Edit :) - the explanation of cebtenzzre clarifies as to why this would happen. Indeed, a flag or something would be handy, like Windows which knows that it didn't shut down properly :)

cebtenzzre commented 2 months ago

Should be fixed as of #2396 (aside from #2591, which is a related but distinct issue).