nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
https://nomic.ai/gpt4all
MIT License
70.68k stars 7.7k forks source link

[Feature] v3.0.0 - No visual progress in Embedding a large (+6 million words) LocalDocs collection #2579

Open SINAPSA-IC opened 4 months ago

SINAPSA-IC commented 4 months ago

Feature Request

Updating an existing LocalDocs collection made of 35 PDF files containing +6 million words, after three hours I am still waiting for

Also, there is no "Stop/Pause" button - once we click on Update the process goes on until we end the program. Such a button would come in handy when we want to pause the Embedding of a (few) collection(s) in order to give processor time to the Embedding of (an)other collection(s).

Suggestion: In the case of (such) large collections, where the Embedding procedure takes a lot of time, the integer in the progress indicator, 0%... 5%... 16%... and so on, should be changed to a real/decimal number like 0.123%... 15.678%... for the user to know that the Embedding is running, very slow, but running nevertheless; while seeing an integer for hours on end, the user may suppose either that the Embedding is running, or that the process has stopped altogether (as presented visually - the same text, unchanged for hours).

img30_ui5

victorem1 commented 4 months ago

I had the same issue with 2 book pdfs with about 200,000 words, embedding was stuck at 0% and there also wasn't much CPU usage. Trying again with a mere text file of 6 words gives the same result.

screenshot

Running GPT4All in the terminal gave these relevant lines:

[Warning] (Mon Jul 8 12:00:28 2024): embllm WARNING: Local embedding model not found
[Warning] (Mon Jul 8 12:00:28 2024): WARNING: Could not load model for embeddings

The error seems to be fired from here: https://github.com/nomic-ai/gpt4all/blob/11b58a1a157e0af7c42061ad6f93459807ecde59/gpt4all-chat/embllm.cpp#L87

static const QString LOCAL_EMBEDDING_MODEL = u"nomic-embed-text-v1.5.f16.gguf"_s;

...

#ifdef Q_OS_DARWIN
    static const QString embPathFmt = u"%1/../Resources/%2"_s;
#else
    static const QString embPathFmt = u"%1/../resources/%2"_s;
#endif

QString filePath = embPathFmt.arg(QCoreApplication::applicationDirPath(), LOCAL_EMBEDDING_MODEL);
    if (!QFileInfo::exists(filePath)) {
        qWarning() << "embllm WARNING: Local embedding model not found";
        return false;
    }

The file can't be found, and it could be because the file is literally missing, the ifdef points to the wrong resources folder, QCoreApplication::applicationDirPath() doesn't resolve to it correctly, or many other options, IDK. (For reference, my installation seems to be using ~/.local/share/nomic.ai/GPT4All/ as its app directory.)

I'm not immediately in a position to debug this, I was just going to share info but then kept diving down into it. If I do though, I'll share what fixes I find.

Things that didn't help:

cosmic-snow commented 4 months ago

@victorem1 I don't think you're running into the same problem here, although I haven't tried to reproduce the original issue.

The error WARNING: Could not load model for embeddings is, as it says, about the model which produces the embeddings and it should be in the resources/ directory of your installation with v3.0 because it was made part of the installer. (Previously, you'd have to download such a model separately.)

How come it's missing for you? How did you install the GPT4All application? In any case, adding all the details in the comment is appreciated.

Also, it's possible you were close to the solution already, try putting the separately downloaded/renamed model in the resources/ folder (depends of course on how/where you installed everything). Edit: Ah wait, that one looks like it's a Quant, probably shouldn't try that.

victorem1 commented 4 months ago

@cosmic-snow I installed it via the AUR originally. Installing via the .run file in a VM had no problems. It looks like the AUR version leaves out a ton of files. My bad.