[Feature] v3.0.0 - No visual progress in Embedding a large (+6 million words) LocalDocs collection

SINAPSA-IC commented 4 months ago

Feature Request

Updating an existing LocalDocs collection made of 35 PDF files containing +6 million words, after three hours I am still waiting for

the Embedding indicator to advance to 1%
a filename to appear, with the rotating symbol that shows that Something is being done to that file.

Also, there is no "Stop/Pause" button - once we click on Update the process goes on until we end the program. Such a button would come in handy when we want to pause the Embedding of a (few) collection(s) in order to give processor time to the Embedding of (an)other collection(s).

Suggestion: In the case of (such) large collections, where the Embedding procedure takes a lot of time, the integer in the progress indicator, 0%... 5%... 16%... and so on, should be changed to a real/decimal number like 0.123%... 15.678%... for the user to know that the Embedding is running, very slow, but running nevertheless; while seeing an integer for hours on end, the user may suppose either that the Embedding is running, or that the process has stopped altogether (as presented visually - the same text, unchanged for hours).

img30_ui5

victorem1 commented 4 months ago

I had the same issue with 2 book pdfs with about 200,000 words, embedding was stuck at 0% and there also wasn't much CPU usage. Trying again with a mere text file of 6 words gives the same result.

screenshot

Running GPT4All in the terminal gave these relevant lines:

[Warning] (Mon Jul 8 12:00:28 2024): embllm WARNING: Local embedding model not found
[Warning] (Mon Jul 8 12:00:28 2024): WARNING: Could not load model for embeddings

The error seems to be fired from here: https://github.com/nomic-ai/gpt4all/blob/11b58a1a157e0af7c42061ad6f93459807ecde59/gpt4all-chat/embllm.cpp#L87

static const QString LOCAL_EMBEDDING_MODEL = u"nomic-embed-text-v1.5.f16.gguf"_s;

...

#ifdef Q_OS_DARWIN
    static const QString embPathFmt = u"%1/../Resources/%2"_s;
#else
    static const QString embPathFmt = u"%1/../resources/%2"_s;
#endif

QString filePath = embPathFmt.arg(QCoreApplication::applicationDirPath(), LOCAL_EMBEDDING_MODEL);
    if (!QFileInfo::exists(filePath)) {
        qWarning() << "embllm WARNING: Local embedding model not found";
        return false;
    }

The file can't be found, and it could be because the file is literally missing, the ifdef points to the wrong resources folder, QCoreApplication::applicationDirPath() doesn't resolve to it correctly, or many other options, IDK. (For reference, my installation seems to be using ~/.local/share/nomic.ai/GPT4All/ as its app directory.)

I'm not immediately in a position to debug this, I was just going to share info but then kept diving down into it. If I do though, I'll share what fixes I find.

Things that didn't help:

Explicitly downloading and installing nomic-embed-text-v1.5-GGUF. It got the same terminal result, and no progress in the UI.
Going to the app data folder and deleting files embeddings_v0.dat, localdocs_v1.db and localdocs_v2.db or any other files.
Renaming the manually downloaded nomic-embed-text-v1.5.Q4_0.gguf to nomic-embed-text-v1.5.f16.gguf, even if just for a different error. It gave the same error.

cosmic-snow commented 4 months ago

@victorem1 I don't think you're running into the same problem here, although I haven't tried to reproduce the original issue.

The error WARNING: Could not load model for embeddings is, as it says, about the model which produces the embeddings and it should be in the resources/ directory of your installation with v3.0 because it was made part of the installer. (Previously, you'd have to download such a model separately.)

How come it's missing for you? How did you install the GPT4All application? In any case, adding all the details in the comment is appreciated.

Also, it's possible you were close to the solution already, try putting the separately downloaded/renamed model in the resources/ folder (depends of course on how/where you installed everything). Edit: Ah wait, that one looks like it's a Quant, probably shouldn't try that.

victorem1 commented 4 months ago

@cosmic-snow I installed it via the AUR originally. Installing via the .run file in a VM had no problems. It looks like the AUR version leaves out a ton of files. My bad.

nomic-ai / gpt4all

[Feature] v3.0.0 - No visual progress in Embedding a large (+6 million words) LocalDocs collection #2579

Feature Request