microsoft / graphrag

A modular graph-based Retrieval-Augmented Generation (RAG) system
https://microsoft.github.io/graphrag/
MIT License
13.27k stars 1.12k forks source link

[Bug]: Errors in local search #451

Closed CCzzzzzzz closed 6 days ago

CCzzzzzzz commented 2 weeks ago

Describe the bug

I successfully ran the global search, but I encountered an error when running the local search.

Error embedding chunk {'OpenAIEmbedding': 'Error code: 400 - {\'error\': "\'input\' field must be a string or an array of strings"}'} Traceback (most recent call last): File "C:\Users\cpdft.conda\envs\myconda\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\cpdft.conda\envs\myconda\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query__main__.py", line 75, in run_local_search( File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\cli.py", line 154, in run_local_search result = search_engine.search(query=query) File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\structured_search\local_search\search.py", line 118, in search context_text, context_records = self.context_builder.build_context( File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\structured_search\local_search\mixed_context.py", line 139, in build_context
selected_entities = map_query_to_entities( File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\context_builder\entity_extraction.py", line 55, in map_query_to_entities search_results = text_embedding_vectorstore.similarity_search_by_text( File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\vector_stores\lancedb.py", line 118, in similarity_search_by_text query_embedding = text_embedder(text) File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\context_builder\entity_extraction.py", line 57, in text_embedder=lambda t: text_embedder.embed(t), File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\graphrag\query\llm\oai\embedding.py", line 96, in embed chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens) File "C:\Users\cpdft.conda\envs\myconda\lib\site-packages\numpy\lib\function_base.py", line 550, in average raise ZeroDivisionError( ZeroDivisionError: Weights sum to zero, can't be normalized

Steps to reproduce

No response

Expected Behavior

No response

GraphRAG Config Used

llm: api_key: ollama type: openai_chat # or azure_openai_chat model: gemma2 model_supports_json: true # recommended if this is available for your model. api_base: http://localhost:11434/v1

embeddings: llm: api_key: lm-studio type: openai_embedding # or azure_openai_embedding model: nomic-ai\nomic-embed-text-v1.5-GGUF\nomic-embed-text-v1.5.Q4_K_M.gguf api_base: http://localhost:1234/v1

Logs and screenshots

No response

Additional Information

SmallliDinosaur commented 2 weeks ago

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

CCzzzzzzz commented 2 weeks ago

一样。这似乎是 base64 和字符串问题。图片是 LMStudio。 图像

Why is this problem only in local search? How did you solve?

SmallliDinosaur commented 2 weeks ago

一样。这似乎是 base64 和字符串问题。图片是 LMStudio。 图像

Why is this problem only in local search? How did you solve?

Sorry, I haven't solved it. But I am looking for information related to LM Studio https://github.com/langchain-ai/langchain/issues/21318 According to what you said, it feels like it's a Local issue. Thank you, it gave me some inspiration.

Kingatlas115 commented 2 weeks ago

its something in the commuity extract scripts or llm parser scripts. just cant nail down what

CCzzzzzzz commented 2 weeks ago

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference.

812406210 commented 2 weeks ago

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference. hello, i use ollama occured same problem. use xinference ,the url api how to get ?

CCzzzzzzz commented 2 weeks ago

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference. hello, i use ollama occured same problem. use xinference ,the url api how to get ?

http://localhost:"ollama_or_xinference_default_port"/v1

812406210 commented 2 weeks ago

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference. hello, i use ollama occured same problem. use xinference ,the url api how to get ?

http://localhost:"ollama_or_xinference_default_port"/v1

thx , but use xinference ,it happen this error ValueError: Query vector size 768 does not match index column size 1536

KylinMountain commented 2 weeks ago

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference. hello, i use ollama occured same problem. use xinference ,the url api how to get ?

http://localhost:"ollama_or_xinference_default_port"/v1

thx , but use xinference ,it happen this error ValueError: Query vector size 768 does not match index column size 1536

looks like you are using different embedding model in index and query?

KylinMountain commented 2 weeks ago

I am using llama.cpp to server embedding api, it is more stable, you can try that.

goodmaney commented 2 weeks ago

same. I use the py script app.py .Maybe it's about the int and str variable.

Error embedding chunk {'OpenAIEmbedding': "Error code: 422 - {'detail': [{'type': 'string_type', 'loc': ['body', 'input', 0], 'msg': 'Input should be a valid string', 'input': 3923, 'url': 'https://errors.pydantic.dev/2.7/v/string_type'}, {'type': 'string_type', 'loc': ['body', 'input', 1], 'msg': 'Input should be a valid string', 'input': 527, 'url': 'https://errors.pydantic.dev/2.7/v/string_type'}, {'type': 'string_type', 'loc': ['body', 'input', 2], 'msg': 'Input should be a valid string', 'input': 279, 'url': 'https://errors.pydantic.dev/2.7/v/string_type'} ....................... ZeroDivisionError: Weights sum to zero, can't be normalized

goodmaney commented 2 weeks ago

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference. hello, i use ollama occured same problem. use xinference ,the url api how to get ?

http://localhost:"ollama_or_xinference_default_port"/v1

thx , but use xinference ,it happen this error ValueError: Query vector size 768 does not match index column size 1536

I reexecute the [python -m graphrag.index --init --root ./ragtest] with xinference embedding ,it not report that error. "768 does not match index column size 1536" is reported when I build the index with py script then query by xinference . But the local search response me nothing 😂 with no error

unxd9c commented 2 weeks ago

Issue is that --method local does not work out of the box with open source embedding models. It is because of the way how OpenAI's text-embedding-3-small model is working. It is using token IDs as input, while open source models like nomic-embed-text are working with text as input. So you need to convert token IDs to text before using open source models.

Solution is to add one line to package's graphrag/query/llm/oai/embedding.py "embed" function :

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...
goodmaney commented 2 weeks ago

The same. It seems like the base64 and strings problem. The picture was the LMStudio. image

It seems that it is really a problem with LMstudio. Don't know how to solve it, but I succeeded in xinference.

Did local search response you something? it response me nothing and not report error

Atarasin commented 2 weeks ago

Issue is that --method local does not work out of the box with open source embedding models. It is because of the way how OpenAI's text-embedding-3-small model is working. It is using token IDs as input, while open source models like nomic-embed-text are working with text as input. So you need to convert token IDs to text before using open source models.

Solution is to add one line to package's graphrag/query/llm/oai/embedding.py "embed" function :

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...

I can use local search by this way, thank you so much.

karthik-codex commented 2 weeks ago

Issue is that --method local does not work out of the box with open source embedding models. It is because of the way how OpenAI's text-embedding-3-small model is working. It is using token IDs as input, while open source models like nomic-embed-text are working with text as input. So you need to convert token IDs to text before using open source models.

Solution is to add one line to package's graphrag/query/llm/oai/embedding.py "embed" function :

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...

Could you show how you modified the "def _embed_with_retry" function in the embedding.py?

I got the embedding to work but later got an error that says "Error: Query vector size 768 does not match index column size 3072". 768 is the length of my embedding vector for the provided query. Not sure what 3072 means. I use nomic-embed-text from Ollama.

unxd9c commented 2 weeks ago

Could you show how you modified the "def _embed_with_retry" function in the embedding.py?

I got the embedding to work but later got an error that says "Error: Query vector size 768 does not match index column size 3072". 768 is the length of my embedding vector for the provided query. Not sure what 3072 means. I use nomic-embed-text from Ollama.

I had a similar looking problem when i tried to use different models for Index (text-embedding-3-small, which generates 1536-dimensional vectors) and Search (nomic-embed-text, which generates 768-dimensional). 3072 looks like text-embedding-3-large.

BTW my def _embed_with_retry is untouched. And i did not yet found a way to work with Ollama's embedding models while doing Search (i use LM Studio).

    def _embed_with_retry(
        self, text: str | tuple, **kwargs: Any
    ) -> tuple[list[float], int]:
        try:
            retryer = Retrying(
                stop=stop_after_attempt(self.max_retries),
                wait=wait_exponential_jitter(max=10),
                reraise=True,
                retry=retry_if_exception_type(self.retry_error_types),
            )
            for attempt in retryer:
                with attempt:
                    embedding = (
                        self.sync_client.embeddings.create(  # type: ignore
                            input=text,
                            model=self.model,
                            **kwargs,  # type: ignore
                        )
                        .data[0]
                        .embedding
                        or []
                    )
                    return (embedding, len(text))
        except RetryError as e:
            self._reporter.error(
                message="Error at embed_with_retry()",
                details={self.__class__.__name__: str(e)},
            )
            return ([], 0)
        else:
            # TODO: why not just throw in this case?
            return ([], 0)
1997shp commented 1 week ago

问题是,--method local对于开源嵌入模型来说,这无法开箱即用。 这是因为 OpenAI 模型的工作方式text-embedding-3-small。它使用 token ID 作为输入,而开源模型则nomic-embed-text使用文本作为输入。 因此,在使用开源模型之前,您需要将 token ID 转换为文本。

graphrag/query/llm/oai/embedding.py解决方案是在包的“嵌入”功能中添加一行:

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...

I can use local search by this way too, thank you so much.

karthik-codex commented 1 week ago

I fixed this as well. you can find my repo to do local indexing and search here. https://medium.com/@karthik.codex/microsofts-graphrag-autogen-ollama-chainlit-fully-local-free-multi-agent-rag-superbot-61ad3759f06f https://github.com/karthik-codex/autogen_graphRAG

sdjd93dj commented 1 week ago

@karthik-codex

I fixed this as well. you can find my repo to do local indexing and search here. https://medium.com/@karthik.codex/microsofts-graphrag-autogen-ollama-chainlit-fully-local-free-multi-agent-rag-superbot-61ad3759f06f https://github.com/karthik-codex/autogen_graphRAG

Apologies for going off-topic, but seeing as you've successfully attempted global search, did you have to make any hotfixes for that? Or was it all smooth sailing?

I ran into this JSON issue, which has this fix and this fix.

Perhaps there's no answer, but I'm a bit curious as to why you might not have run into the issue, unless that simply isn't discussed in the blog.

karthik-codex commented 1 week ago

@karthik-codex

I fixed this as well. you can find my repo to do local indexing and search here. https://medium.com/@karthik.codex/microsofts-graphrag-autogen-ollama-chainlit-fully-local-free-multi-agent-rag-superbot-61ad3759f06f https://github.com/karthik-codex/autogen_graphRAG

Apologies for going off-topic, but seeing as you've successfully attempted global search, did you have to make any hotfixes for that? Or was it all smooth sailing?

I ran into this JSON issue, which has this fix and this fix.

Perhaps there's no answer, but I'm a bit curious as to why you might not have run into the issue, unless that simply isn't discussed in the blog.

No I did not get any of these issues. I also used Mistral instead of Llama, which was suggested by an Youtuber for its longer context window than llama.

sdjd93dj commented 1 week ago

Interesting, Mistral did not fix my problem, but I'll try again with your repo

On Jul 19, 2024 10:18 AM, Karthik Rajan @.***> wrote:

@karthik-codexhttps://github.com/karthik-codex

I fixed this as well. you can find my repo to do local indexing and search here. @.***/microsofts-graphrag-autogen-ollama-chainlit-fully-local-free-multi-agent-rag-superbot-61ad3759f06f https://github.com/karthik-codex/autogen_graphRAG

Apologies for going off-topic, but seeing as you've successfully attempted global search, did you have to make any hotfixes for that? Or was it all smooth sailing?

I ran into this JSON issuehttps://github.com/microsoft/graphrag/issues/575, which has thishttps://github.com/microsoft/graphrag/pull/609 fix and thishttps://github.com/microsoft/graphrag/pull/473/files fix.

Perhaps there's no answer, but I'm a bit curious as to why you might not have run into the issue, unless that simply isn't discussed in the blog.

No I did not get any of these issues. I also used Mistral instead of Llama, which was suggested by an Youtuber for its longer context window than llama.

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/graphrag/issues/451#issuecomment-2239286645, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BJSRH5NNTOSY44TCPUDTNZLZNEN23AVCNFSM6AAAAABKR5UQG6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZZGI4DMNRUGU. You are receiving this because you commented.Message ID: @.***>

adirsingh96 commented 1 week ago

Issue is that --method local does not work out of the box with open source embedding models. It is because of the way how OpenAI's text-embedding-3-small model is working. It is using token IDs as input, while open source models like nomic-embed-text are working with text as input. So you need to convert token IDs to text before using open source models.

Solution is to add one line to package's graphrag/query/llm/oai/embedding.py "embed" function :

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...

this is working , but it is giving completely out of context answers

natoverse commented 6 days ago

Consolidating alternate model issues here: #657

yurochang commented 6 days ago

问题是,--method local对于开源嵌入模型来说,这无法开箱即用。 这是因为 OpenAI 模型的工作方式text-embedding-3-small。它使用 token ID 作为输入,而开源模型则nomic-embed-text使用文本作为输入。 因此,在使用开源模型之前,您需要将 token ID 转换为文本。

graphrag/query/llm/oai/embedding.py解决方案是在包的“嵌入”功能中添加一行:

...
def embed(self, text: str, **kwargs: Any) -> list[float]:
        """
        Embed text using OpenAI Embedding's sync function.

        For text longer than max_tokens, chunk texts into max_tokens, embed each chunk, then combine using weighted average.
        Please refer to: https://github.com/openai/openai-cookbook/blob/main/examples/Embedding_long_inputs.ipynb
        """
        token_chunks = chunk_text(
            text=text, token_encoder=self.token_encoder, max_tokens=self.max_tokens
        )
        chunk_embeddings = []
        chunk_lens = []
        for chunk in token_chunks:
            # decode chunk from token ids to text (added line after row 83)
            chunk = self.token_encoder.decode(chunk)
            try:
                embedding, chunk_len = self._embed_with_retry(chunk, **kwargs)
                chunk_embeddings.append(embedding)
                chunk_lens.append(chunk_len)
            # TODO: catch a more specific exception
            except Exception as e:  # noqa BLE001
                self._reporter.error(
                    message="Error embedding chunk",
                    details={self.__class__.__name__: str(e)},
                )

                continue
        chunk_embeddings = np.average(chunk_embeddings, axis=0, weights=chunk_lens)
        chunk_embeddings = chunk_embeddings / np.linalg.norm(chunk_embeddings)
        return chunk_embeddings.tolist()
...

cool, it solved the problem!

adirsingh96 commented 6 days ago

Hey, are you getting relevant answers?

lyyf2002 commented 4 days ago

I fix it and create a PR 568. Hope it will be merged soon.