nomic-ai / gpt4all

GPT4All: Run Local LLMs on Any Device. Open-source and available for commercial use.
https://nomic.ai/gpt4all
MIT License
70.16k stars 7.67k forks source link

LocalDocs Hanging on v3.4.1 when indexing on Windows 10 #3086

Closed TheAlex25 closed 5 days ago

TheAlex25 commented 1 week ago

GPT4All version: v3.4.1 Operating System: Microsoft Windows 10 (Version 10.0.19045 Build 19045) Computer model: Microsoft Surface Book 2

Hello folks. My LocalDocs is hanging when indexing on GPT4All v3.4.1 (latest update). I am using a Windows 10 machine (Microsoft Surface Book 2) with 8GB of RAM, with the indexing being run on an intel i7 CPU. This is my first time using GitHub, so forgive me of any formatting issues. Just a few seconds ago, it started indexing again jumping from 4% to 11% indexed, but stopped at 20 files (11%). Does GPT4All now index in stepped pattern, where it indexes for a few minutes, embeds, and then indexes again? I will report back in a few hours to see if GPT4All has a different method of indexing.

By the way, I am really grateful for locally run AI, as I can chat privately with AI, and not have to pay a single cent to ClosedAI.

Image

These are the sizes of some of the files. I downloaded a bunch of PDFs on theology from Protestant, Orthodox and Catholic literature so the machine can compare, analyze and understand theology. Before GPT4All had bugs, I had multiple folders ranging from science, history and movie reviews. If GPT4All no longer hangs, then I can use it as a general encyclopedia to chat with. Kinda like a Star Trek LCARS computer or having a conversation with Data.

Image

manyoso commented 1 week ago

"I downloaded a bunch of PDFs on theology from Protestant, Orthodox and Catholic literature so the machine can compare, analyze and understand theology."

Do you have a link where I can download the identical set to see if I can reproduce? None of the developers have been able to reproduce hangs with pdf parsing yet...

"Does GPT4All now index in stepped pattern, where it indexes for a few minutes, embeds, and then indexes again?"

Yes, but it has always done this. The relative step sizes have changed though in v3.4.0 so that is likely what you're noticing. Are you sure it is truly hanging?

TheAlex25 commented 1 week ago

It appears the LocalDocs is still hanging, three hours later at 11% indexed. It has embedded a lot (10563 embeddings), but not indexed anything further. Here are the files.

Bishop Sotirios - Orthodox Catechism, 4th Edition [Religious Document] (2015).pdf Bp. Athanasius Schneider - Credo [Book] (2023).pdf Brother Lawrence - The Practice of the Presence of God [Book] (c. 1692).pdf C.S. Lewis - Mere Christianity [Book] (1952).pdf Catholic Bible (CPDV) [Bible] (2009).pdf Church of Nigeria - The Order for Holy Communion [Religious Document] (1996).pdf Cliff McManis - Apologetics by the Book [Book] (2017).pdf Corrie Ten Boom - The Hiding Place [Book] (1971).pdf Diarmaid MacCulloch - Reformation [Book] (2004).pdf Dioscese of Rochester - Traditional Catholic Prayers [Prayer Guide].pdf ELS - Small Lutheran Catechism [Religious Document] (2016).pdf Episcopal Church - Episcopalian Book of Common Prayer [Religious Document] (2006).pdf Eric Mason - Urban Apologetics [Book] (2021).pdf Fr. Thomas de Saint-Laurent - The Book of Confidence [Book] (1989).pdf Jacobus Arminius - Works Vol. 1 [Book] (c. 1600).pdf Jacobus Arminius - Works Vol. 2 [Book] (c. 1600).pdf Jacobus Arminius - Works Vol. 3 [Book] (c. 1600).pdf Jean Baptiste Chautard - The Soul of the Apostolate [Book] (1912).pdf John Anthony McGuckin - The Orthodox Church [Book] (2008).pdf John Calvin - The Institutes of the Christian Religion [Book] (1536).pdf John Piper - The Future of Justification [Book] (2007).pdf John Stott - Jesus is Lord [Book] (2016).pdf John Warwick Montgomery - Christ as Centre and Circumference [Book] (2012).pdf Joseph Tissot - The Interior Life [Book] (c. 1900).pdf Journal for Baptist Theology & Ministry Vol. 6, No. 1 [Religious Document] (2009).pdf Lawrence S. Cunningham - A Brief History of Saints [Book] (2005).pdf Louis R. Tarsitano - An Outline of an Anglican Life [Book] (1994).pdf Michael Ramsey - The Anglican Spirit [Book] (2004).pdf N.T. Wright - Surprised By Hope [Book] (2008).pdf Nick R. Needham - 2000 Years of Christ's Power, Vol. 1 [Book] (1997).pdf Open Hymnal [Song Book] (2014).pdf Paul Thigpen - Manual For Spiritual Warfare [Book] (2014).pdf R. Herbert - A Brighter Light [Book] (2024).pdf Richard Hooker - Laws of Ecclesiastical Polity Vol. 1 [Book] (2019).pdf Salvador Canals - Jesus as Friend [Book] (1962).pdf St. Augustine of Hippo - Confessions of Saint Augustine [Book] (c. 400 AD).pdf St. John Climacus - The Ladder of Divine Ascent [Book] (c. 600 AD).pdf St. Louis Marie de Montfort - True Devotion to the Blessed Virgin [Book] (1987).pdf St. Thomas More - Dialogue of Comfort Against Tribulation [Book] (c. 1550s).pdf Steven Ball - A Christian Physicist Examines the Age of the Earth [Journal Article] (2003).pdf The Anglican Communion Office UK - Principles of Canon Law [Religious Document] (2008).pdf The Fatima Center - Catholic Catechism [Religious Document] (2017).pdf Thomas Aquinas - Summa Contra Gentiles [Book] (c. 1260).pdf Tony Evans - Victory in Spiritual Warfare [Book] (2011).pdf USCCB - Catechism of the Catholic Church [Religious Document] (2011).pdf

Watchman Nee - The Normal Christian Life [Book] (1957).pdf William Lane Craig - The Atonement [Book] (2018).pdf

manyoso commented 1 week ago

I downloaded and indexed/embedded all of these in just a few minutes. I'm using CUDA for the embedding device though so that's why it is so fast. Let's see what happens when I use CPU... however, there are no hangs in the indexing for me. Do you see embeddings continuing or the number of words going up even if the index is stopped at 11%?

TheAlex25 commented 1 week ago

Embeddings are still going, now at 13,661. The number of words indexed is frozen at 2.61 million. Are you using Windows 10? Now GPT4All is not responding. Maybe the RAM usage is not optimized?

Wow, your computer is fast. When GPT4All worked, it took 12 hours to embed this stuff, cos my CPU is in the tens of gigaFLOPs order of magnitude, not multiple teraFLOPs of speed...

manyoso commented 1 week ago

I think we may have a speed regression when generating embeddings with CPU. https://github.com/nomic-ai/gpt4all/issues/3088

joomgallerytestit commented 1 week ago

Hello,

unfortunately I have to confirm that the LOCAL DOCUMENTS function worked even better with version 3.3.0 than with 3.4.1, although there were problems with reading certain PDFs.

I installed 3.4.1 yesterday and hoped that the build process would no longer abort when binding documents in a folder. However, this is now the least of my problems: GPT4ALL 3.4.1 no longer even manages to summarize the content of a read-in verdict (1 PDF) and completely hallucinates about completely different content.

I use Windows 10 64 Bit Prof. and an Nvidia 4080 16GB + Core i7 und 32GB RAM.

Kind regards JGtestit

manyoso commented 1 week ago

However, this is now the least of my problems: GPT4ALL 3.4.1 no longer even manages to summarize the content of a read-in verdict (1 PDF) and completely hallucinates about completely different content.

It isn't likely that any regression happened re: hallucination as this has more to do with the model than the actual localdocs retrieval. However, there is a bug that does unfortunately allow retrieval from a collection to happen even though the collection is not selected. This has been fixed but it is awaiting release: https://github.com/nomic-ai/gpt4all/issues/3076

But pertinent to this issue: have you experienced hangs indexing/embedding your pdfs with v3.4.1? Unfortunately, I am still unable to reproduce any hangs with pdf indexing/embedding like what can be found in the OP.

manyoso commented 1 week ago

I think I may have just been able to reproduce in a contrived manner and diagnosed the problem. Working on a fix...

manyoso commented 1 week ago

I think this might be what you're encountering: https://github.com/nomic-ai/gpt4all/pull/3089

manyoso commented 6 days ago

Can you confirm this is fixed with the new v3.4.2 release?

joomgallerytestit commented 6 days ago

Hi,

I have done a few tests and can confirm that under Windows 10 Prof. with an Nvidia 4070 Super 16GB GPT4ALL with Llama 3.2 3B Instruct, the index process no longer hangs with 3.4.2.

Unfortunately, there is another problem: Although I select a collection with indexed files under LocalDocs, no reference is made to these documents/sources in the answers to specific test questions, although the indexed source is specified under the answer.

This must have something to do with the fact that each answer is beginning with “ Your message was too long and could not be processed.” even though the question is not even one line long. Context Length is set to 2048 and Max Length to 4096.

Kind regards JGtestit

P. S.: How does GPT4ALL work with RAG? If LocalDocs are activated, are their contents EXCLUSIVELY included in the response, or are they merely prioritized?

manyoso commented 5 days ago

Hi @joomgallerytestit can you show me the settings you have here:

Image

I suspect you've changed the advanced defaults if you're running into, "Your message was too long and could not be processed."

I've tried to mark these settings as Advanced and added descriptions that explain what can happen when you change them. Any suggestions for improvement to that language to make it clearer?

joomgallerytestit commented 5 days ago

Hi @manyoso,

I had a Document snippet size 2048 and Max document snippets of 50 here in ADVANCED. T

To what extent is this a problem with the configuration I am using (Nvidia 4080 Super 16GB, Llama 3.2 3B Instruct)?

In any case, I reset the settings under LocalDocs with Restore Defaults, which promptly set Document Snippet Size to 512 and Max Document Snippets to 3, which in my opinion are inappropriate for my purposes.

@manyoso Is it possible that my problem is due to too low values for Model -> Context Length 2048 and/or too high values for Prompt Batch Size (default 128)? Llana 3.2 Instruct 3B supports 128K! and not only 2000 Tokens.

I then indexed a PDF of about 1000 pages as a test. Now I no longer get the message "Your message was too long and could not be processed.", but not a single question with reference to the LocalDocs document selected in the top left corner is answered correctly, even though Source XY is written below the answer. I will now test test documents with a few pages again and see if no suitable answers are generated in the answers.

Regardless, I would be grateful for an answer to the above question, even if it does not work for me at the moment: How does GPT4ALL work with RAG? If LocalDocs are activated, are their contents EXCLUSIVELY included in the response, or are they merely prioritized?

Thanks and kind regards JGTestit

manyoso commented 5 days ago

You're attempting to inject 50 different excerpts of 512 characters each into the context window whenever you ask the model a question. You're trying to inject 25k of data into a model that only handles 2000 tokens total. You might be able to set the context window higher, but regardless the UI labels these settings as a WARNING: Advanced usage only. and explains that altering them can result in failure and that is what you've run into.

Regardless, this bug is being closed as the localdocs hang seems to be fixed so that you for confirming that!