add local docs got to 93% then the number of embeddings of total embeddings went negative

andypbaker commented 1 month ago

Bug Report

updated to version 3.1.0 localdocs collections had to be updated. selected update (can't remember what the verb was but it meant update). my colection folder contained 47 documents of varying sizes. embedding got to 93% after about 8 hrs. then the % chaneg to 0% and the number of embeddings of total embeddings changed to -18446744073709319000 of 33026 embeddings. it might have got to 32767 then turned negative. localdocs didn't cope, got to 93 percent then went negative Screenshot 2024-07-24 185412

Steps to Reproduce

add a local docs folder that contains e.g. 100 documents enough to create 33026 or more embeddings

Expected Behavior

expected it to reach 100% complete.

Your Environment

GPT4All version: 3.1.0
Operating System: Windows 11
Chat model used (if applicable):

mmahmoudian commented 1 month ago

I have been facing this issue. The file name is clean and only contains upper-case, lower-case, numbers, and hyphen, so it is not duplicate of #2658 .

I went through a long long path of finding what has caused the crash. Surprisingly enough, the problematic file is NOT the one that is shown in gpt4all GUI (number 2). I found this by copy-pasting txt files one-by-one to that folder and see how the embedding process goes.

I finally found the file and manage to get it fixed by removing all the "weird" characters

sed -i 's/[^a-zA-Z0-9 \-\(\)\.,;:\>\<\?=+%]//g' THE-FILE

So next step was to find which character was the problematic one. This short R script shows us the way:

library("Unicode")

setwd("/home/mehrad/Documents/Data/gpt4all/test/")

## read files
dirty <- readLines("dirty.txt", warn = F)
cleaned <- readLines("cleaned.txt", warn = F)

## split to single unique characters
cleaned_chars <- strsplit(x = cleaned, split = "") |>
    unlist() |>
    unique()

dirty_chars <- strsplit(x = dirty, split = "") |>
    unlist() |>
    unique()

# get the diff of these chars along with their unicode representation
setdiff(dirty_chars, cleaned_chars) |>
    sapply(function(x){
        tmp_unicode <- as.character(Unicode::as.u_char(utf8ToInt(x)))
        return(c(x, tmp_unicode))
    }) |>
    rbind.data.frame() |>
    t()

This results in

     [,1]   [,2]    
/    "/"    "U+002F"
-    "-"    "U+002D"
Š    "Š"    "U+0160"
—    "—"    "U+2014"
\002 "\002" "U+0002"
“    "“"    "U+201C"
”    "”"    "U+201D"
’    "’"    "U+2019"
α    "α"    "U+03B1"
|    "|"    "U+007C"
∂    "∂"    "U+2202"
−    "−"    "U+2212"
_    "_"    "U+005F"
≥    "≥"    "U+2265"
Δ    "Δ"    "U+0394"
∪    "∪"    "U+222A"
Π    "Π"    "U+03A0"
∈    "∈"    "U+2208"
×    "×"    "U+00D7"
β    "β"    "U+03B2"
γ    "γ"    "U+03B3"
{    "{"    "U+007B"
}    "}"    "U+007D"
′    "′"    "U+2032"
‖    "‖"    "U+2016"
≤    "≤"    "U+2264"
λ    "λ"    "U+03BB"
η    "η"    "U+03B7"
˜    "˜"    "U+02DC"
ˆ    "ˆ"    "U+02C6"
√    "√"    "U+221A"
ï    "ï"    "U+00EF"

one of these characters is breaking gpt4all . I tried to insert each character in a text file (in hello {} world the character would have replaced {}) and make the gpt4all to "embed" them. This attempt failed which shows that whatever it is related to these character (since removing them from the text fixed the issue), but how it affect the indexer is unknown to me.

fuzhyperblue commented 6 days ago

I also have same exact problem, probably an issue with turkish characters.

CleanShot 2024-09-01 at 10 31 06@2x

nomic-ai / gpt4all