Gpt4All can't retrieve all info from all files in A LocalDocs folder

d76534558 commented 2 months ago

Bug Report

Gpt4All is unable to consider all files in the LocalDocs folder as resources

Steps to Reproduce

Create a folder that has 35 pdf files. Each file is about 200kB size
Prompt to list details that exist in the folder files (Prompt: List the IPs and their properties belonging to TSMC 55nm technology )
Select a collection to make it available to the chat model.
Model parameters are attached in the image
Actual Unexpected Behavior

Listed only 3 IPs with their properties extracted from only 3 (out of 35 files) pdf files and announced that the number of sources is 3 (should be 35)

Expected Behavior

I expected to list 35 IPs and their properties extracted from 35 PDF files.

Your Environment

GPT4All version: v3.2.1
Operating System: Windows 11
Chat model used (if applicable): Llama 3.1 BB Instruct 128k

d76534558 commented 2 months ago

Details :

manyoso commented 2 months ago

Is it possible to share this collection of documents?

I think this is a feature request for an enhancement of localdocs to allow for more detailed queries and also possibly queries about the metadata of the collection itself? If you can share the collection it would make it so I could understand the feature request better and dig into the technical hurdles we'd have to overcome to implement the feature

d76534558 commented 2 months ago

Is it possible to share this collection of documents?

I think this is a feature request for an enhancement of localdocs to allow for more detailed queries and also possibly queries about the metadata of the collection itself? If you can share the collection it would make it so I could understand the feature request better and dig into the technical hurdles we'd have to overcome to implement the feature

Here they are:

055TSMC_ADC_02_Brief.pdf 055TSMC_ADC_03_Brief.pdf 055TSMC_ADC_12_Brief.pdf 055TSMC_ADC_13_Brief.pdf 055TSMC_CML_01_Brief.pdf 055TSMC_DAC_01_Brief.pdf 055TSMC_DAC_02_Brief.pdf 055TSMC_DAC_03_Brief.pdf 055TSMC_DCDC_03_Brief.pdf 055TSMC_DLL_01_Brief.pdf 055TSMC_IFA_01_Brief.pdf 055TSMC_LDO_01_Brief.pdf 055TSMC_LDO_02_Brief.pdf 055TSMC_LDO_09_Brief.pdf 055TSMC_LNA_01_Brief.pdf 055TSMC_LNA_02_Brief.pdf 055TSMC_LVDS_03_Brief.pdf 055TSMC_MIX_03_Brief.pdf 055TSMC_OSC_01_Brief.pdf 055TSMC_OSC_01_Brief_1.pdf 055TSMC_PA_03_Brief.pdf 055TSMC_PA_05_Brief.pdf 055TSMC_PLL_01_Brief.pdf 055TSMC_PLL_02_Brief.pdf 055TSMC_PLL_03_Brief.pdf 055TSMC_PLL_08_Brief.pdf 055TSMC_PMU_01_Brief.pdf 055TSMC_PVT_01_Brief.pdf 055TSMC_PVT_03_Brief.pdf 055TSMC_QF_01_Brief.pdf 055TSMC_QF_02_Brief.pdf 055TSMC_RS_02_Brief.pdf 055TSMC_RS_05_Brief.pdf 055TSMC_VCO_01_Brief.pdf 055UMC_DCDC_01_Brief.pdf

manyoso commented 2 months ago

Thanks for the test collection! Can you also send the exact prompt and model you're trying to solicit a response with?

d76534558 commented 2 months ago

Thanks for the test collection! Can you also send the exact prompt and model you're trying to solicit a response with?

You are welcome

The exact prompt : List the IPs and their properties belonging to TSMC 55nm technology

The model: Llama 3.1 BB Instruct 128k

vaibhav-s-dabhade commented 2 months ago

I tried gpt4all to cross check since I have same issue in my own local RAG application which I am developing with langchain, Chroma, Ollama. When I provide query "List all the names" from around 20 resumes pdf I embedded in chrom vector, it only considers 3-5 max pdf's in serach variantly. I have below environment.

I works best for all other specifc queries for RAG & Non-RAG modes

Linux - Ubuntu 22.04 WSL - VS Code on Windows connected to WSL GPU - Nvida Cuda LLM Server - Ollama

LLM Models Experimented : -llama3_1_8b

gemma:7b
phi3
llama3:8b
mistral:7b
codegemma:7b
mistral-nemo:12b-instruct-2407-q8_0

Embedding Models Experimented :

sentence-transformers/all-MiniLM-L6-v2
- nomic-ai/nomic-embed-text-v1.5-Q
- mixedbread-ai/mxbai-embed-large-v1
- BAAI/bge-small-en-v1.5
- sentence-transformers/all-mpnet-base-v2
- text-embedding-ada-002
- multi-qa-MiniLM-L6-dot-v1
- e5-mistral-7b-instruct-v1
- snowflake-arctic-embed
- jina-embeddings-v2-base-en
- dunzhang/stella_en_1.5B_v5
- Qdrant/bm42-all-minilm-l6-v2-attentions

Embedding Provider Experimented :

HuggingFace
- Fast
- Ollama

Vector DB Experimented :

Chroma
- Qdrant
- Faiss

Prompts options experimented :

basic
- multiquery
- stuffed
- historyaware

Retrieval options experimented :

similarity
- mmr
- similarity_score_threshold
k: 0 - 50 fetch_k: 0 - 50

No success till now....

I think this is common issue and reuires multiple strategies to deal with. I am trying different options further in my application.

Not sure gpt4all answer for it.

d76534558 commented 2 months ago

I tried gpt4all to cross check since I have same issue in my own local RAG application which I am developing with langchain, Chroma, Ollama. When I provide query "List all the names" from around 20 resumes pdf I embedded in chrom vector, it only considers 3-5 max pdf's in serach variantly. I have below environment.

I works best for all other specifc queries for RAG & Non-RAG modes

Linux - Ubuntu 22.04 WSL - VS Code on Windows connected to WSL GPU - Nvida Cuda LLM Server - Ollama

LLM Models Experimented : -llama3_1_8b - gemma:7b - phi3 - llama3:8b - mistral:7b - codegemma:7b - mistral-nemo:12b-instruct-2407-q8_0

Embedding Models Experimented :

sentence-transformers/all-MiniLM-L6-v2

nomic-ai/nomic-embed-text-v1.5-Q

mixedbread-ai/mxbai-embed-large-v1

BAAI/bge-small-en-v1.5

sentence-transformers/all-mpnet-base-v2

text-embedding-ada-002

multi-qa-MiniLM-L6-dot-v1

e5-mistral-7b-instruct-v1

snowflake-arctic-embed

jina-embeddings-v2-base-en

dunzhang/stella_en_1.5B_v5

Qdrant/bm42-all-minilm-l6-v2-attentions

Embedding Provider Experimented :

HuggingFace

Fast

Ollama

Vector DB Experimented :

Chroma

Qdrant

Faiss

Prompts options experimented :

basic

multiquery

stuffed

historyaware

Retrieval options experimented :

similarity

mmr

similarity_score_threshold

k: 0 - 50 fetch_k: 0 - 50

No success till now....

I think this is common issue and reuires multiple strategies to deal with. I am trying different options further in my application.

Not sure gpt4all answer for it.

Thanks for your rich, informative reply.

Can we conclude that it is a general issue in all current local AIs?

d76534558 commented 2 months ago

I made another experiment: I merged the 35 files into one big file and extracted only 3 of the 35 IPs in one file!

cosmic-snow commented 2 months ago

Can you try increasing Context Size in the model settings and Max document snippets per prompt in the LocalDocs settings?

d76534558 commented 2 months ago

Can you try increasing Context Size in the model settings and Max document snippets per prompt in the LocalDocs settings?

Untitled

It seems successful when I increase "context Length from 2048 to 4096 and "Max prompt snippets per prompt" from 3 to 7, the number of resources (files) that can handle increased from 3 to 7.

cosmic-snow commented 2 months ago

You may also want to have a look at this page on the wiki regarding LocalDocs.

manyoso commented 2 months ago

This PR I think may help the situation: https://github.com/nomic-ai/gpt4all/pull/2879 Will test on your specific docs soon.

nomic-ai / gpt4all