Open SINAPSA-IC opened 4 months ago
I had the same issue with 2 book pdfs with about 200,000 words, embedding was stuck at 0% and there also wasn't much CPU usage. Trying again with a mere text file of 6 words gives the same result.
Running GPT4All in the terminal gave these relevant lines:
[Warning] (Mon Jul 8 12:00:28 2024): embllm WARNING: Local embedding model not found
[Warning] (Mon Jul 8 12:00:28 2024): WARNING: Could not load model for embeddings
The error seems to be fired from here: https://github.com/nomic-ai/gpt4all/blob/11b58a1a157e0af7c42061ad6f93459807ecde59/gpt4all-chat/embllm.cpp#L87
static const QString LOCAL_EMBEDDING_MODEL = u"nomic-embed-text-v1.5.f16.gguf"_s;
...
#ifdef Q_OS_DARWIN
static const QString embPathFmt = u"%1/../Resources/%2"_s;
#else
static const QString embPathFmt = u"%1/../resources/%2"_s;
#endif
QString filePath = embPathFmt.arg(QCoreApplication::applicationDirPath(), LOCAL_EMBEDDING_MODEL);
if (!QFileInfo::exists(filePath)) {
qWarning() << "embllm WARNING: Local embedding model not found";
return false;
}
The file can't be found, and it could be because the file is literally missing, the ifdef points to the wrong resources folder, QCoreApplication::applicationDirPath() doesn't resolve to it correctly, or many other options, IDK. (For reference, my installation seems to be using ~/.local/share/nomic.ai/GPT4All/
as its app directory.)
I'm not immediately in a position to debug this, I was just going to share info but then kept diving down into it. If I do though, I'll share what fixes I find.
Things that didn't help:
embeddings_v0.dat
, localdocs_v1.db
and localdocs_v2.db
or any other files.nomic-embed-text-v1.5.Q4_0.gguf
to nomic-embed-text-v1.5.f16.gguf
, even if just for a different error. It gave the same error.@victorem1 I don't think you're running into the same problem here, although I haven't tried to reproduce the original issue.
The error WARNING: Could not load model for embeddings
is, as it says, about the model which produces the embeddings and it should be in the resources/
directory of your installation with v3.0 because it was made part of the installer. (Previously, you'd have to download such a model separately.)
How come it's missing for you? How did you install the GPT4All application? In any case, adding all the details in the comment is appreciated.
Also, it's possible you were close to the solution already, try putting the separately downloaded/renamed model in the resources/
folder (depends of course on how/where you installed everything). Edit: Ah wait, that one looks like it's a Quant, probably shouldn't try that.
@cosmic-snow I installed it via the AUR originally. Installing via the .run file in a VM had no problems. It looks like the AUR version leaves out a ton of files. My bad.
Feature Request
Updating an existing LocalDocs collection made of 35 PDF files containing +6 million words, after three hours I am still waiting for
Also, there is no "Stop/Pause" button - once we click on Update the process goes on until we end the program. Such a button would come in handy when we want to pause the Embedding of a (few) collection(s) in order to give processor time to the Embedding of (an)other collection(s).
Suggestion: In the case of (such) large collections, where the Embedding procedure takes a lot of time, the integer in the progress indicator, 0%... 5%... 16%... and so on, should be changed to a real/decimal number like 0.123%... 15.678%... for the user to know that the Embedding is running, very slow, but running nevertheless; while seeing an integer for hours on end, the user may suppose either that the Embedding is running, or that the process has stopped altogether (as presented visually - the same text, unchanged for hours).