nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
https://nomic.ai/gpt4all
MIT License
70.23k stars 7.67k forks source link

LocalDocs - some docx-files not opened or not chunked properly #3138

Open stoniesed opened 2 days ago

stoniesed commented 2 days ago

Hi, trying to set up a database with LocalDocs but when i open the file log.txt, i can see that around 1/3 of the docx-files can not be opened - "[Warning] (Wed Oct 23 22:25:54 2024): LocalDocs ERROR: Failed to open DOCX:.....". Is this a well known issue, and will there be a fix?

With almost identical documents, based on the same template, some are opened, others are not, seems to happen quite randomly and happens everytime I start an embedding prosess.

LocalDocs shows number of all files indexed, also the files which are not opened. I've also noticed there are tree kind off docx-files, those which GTP4all can open for embeding, files that cannot be opened for embeding - but shows no error in the log.txt-file, and files that cannot be opened and shows error in the log.txt-file

I've reinstalled Office, GPT4all, register cleaned1, etc.

manyoso commented 2 days ago

Ark 3_CV_XX XXX.docx Ark 3_CV_1.docx Ark 3_CV_2.docx foo.docx

Of these, only Ark 3_CV_1.docx and Ark 3_CV_2.docx appear to be chunked properly. The others are missing chunks although they all do open for me.

stoniesed commented 2 days ago

Hi, and thanks for swift reply! Does this mean that you can embed those two files without any problem with nomic-embed-text? And that the last file do open, but no chunk/embedding is taken place? This is in line with what I have seen when testing out embedding docx-files only. Some files do open, but are not chunked/embedded properly, so no error-report in log.txt. It might look like everything is OK, but it's not.

Edit (sorry): Ark 3_CV_1.docx and Ark 3_CV_2.docx do open also in my setup. They are an example of docx that it did work on.

These are only a couple of files, I can send you more, out of some thousands that will not open or not be chunked properly. Only docx-files have these problems. My PC has a fresh W11 and I've reinstalled GPT4All several times, to be sure nothing wrong with the setup.

By the way, I'm not aware of any foo-document:)

manyoso commented 1 day ago

I've attached a PR that seems to help with missing chunks in the files you've given. I still haven't seen any instances of the error message you describe about LocalDocs ERROR: Failed to open DOCX unfortunately.

Another note: some of the files you gave are formatted with tables. That is a CV with a table can be found in Ark 3_CV_XX XXX.docx and we don't parse tables. That leaves all the text inside the tables dropped on the floor. I'm going to open another issue to track that one so we get it fixed.

stoniesed commented 1 day ago

I've found the reason for the error-messages in txt.log, Nomic-embed-text does not accept the Nordic letters æøå in the file names! The files give me erros but not you, since I've renamed them before sending them to you. However, this does not solve the main problem, docx that will not open and not giving any error message...

manyoso commented 1 day ago

I've found the reason for the error-messages in txt.log, Nomic-embed-text does not accept the Nordic letters æøå in the file names! The files give me erros but not you, since I've renamed them before sending them to you.

Wait, are you saying the reason that you're getting the LocalDocs ERROR: Failed to open DOCX is because of a filename issue? I'm guessing this is on Windows? Can you give the exact filename and confirm this is on windows?

stoniesed commented 1 day ago

Yes, but the file names with these letters give no problem what so ever in Office/Windows or else... So is it on Windows, how can I find that out, change to native language W11 (I now use EN)? I now done more testing, and it is the same also for doc, but not for pdf-files! Everytime an æ, ø or å is a part of a doc/docx-file-name, running GPT4All with such a file will give the following error in the log-file and stop the file from opening:

_[Warning] (Sat Oct 26 12:52:26 2024): ERROR: Watched folder does not exist in db "C:\1. Prosjekter\Kartlegging av i elver\01 Contract\Tilbud\Samlefolder\New folder\New folder" [Warning] (Sat Oct 26 12:52:39 2024): LocalDocs ERROR: Failed to open DOCX: C:/1. Prosjekter/Kartlegging av i elver/01 Contract/Tilbud/Samlefolder/New folder/New folder/CV3not OK adding ø.docx

Adding such a letter to any doc/docx filename, and the file will not open, also those files that did open OK before (and will now generate the error message as shown above).

I've now added tree CVs: CV1 should be fin. CV2 will not open and not generate en error message CV3 was a file with ø in its name. Removing the ø also removes the warning, but the file will still not open CV3 keeping ø in file name. File will not open and an error message is generated as shown above.

CV1_OK .docx CV2_not OK.docx CV3_not OK adding ø.docx CV3_not OK.docx

By the way, I see some files, doc, docx generating errors "Ignoring file with binary data" - looking into the files, I see no binary data - if there should be any, why not just leave the binary data out? At least we get an error-report, but handling many files, it's not easy to keep an overview over which files are embedded OK or not.