langchain-ai / langchain

šŸ¦œšŸ”— Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.1k stars 14.67k forks source link

Neo4jVector doesn't work well with HuggingFaceEmbeddings when reusing the graph #24401

Closed SeeleZaych closed 1 month ago

SeeleZaych commented 1 month ago

Checked other resources

Example Code

from langchain_community.vectorstores import Neo4jVector
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
self.existing_graph_parts = Neo4jVector.from_existing_graph(
    embedding=embeddings,
    url=uri,
    username=username,
    password=password,
    node_label="part",
    text_node_properties=["name"],
    embedding_node_property="embedding",
)

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "D:\graph_rag.py", line 133, in <module>
    graph_rag = GraphRag()
                ^^^^^^^^^^
  File "D:\graph_rag.py", line 44, in __init__
    self.existing_graph_parts = Neo4jVector.from_existing_graph(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\langchain_community\vectorstores\neo4j_vector.py", line 1431, in from_existing_graph
    text_embeddings = embedding.embed_documents([el["text"] for el in data])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\langchain_huggingface\embeddings\huggingface.py", line 87, in embed_documents
    embeddings = self.client.encode(
                 ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 565, in encode
    if all_embeddings[0].dtype == torch.bfloat16:
       ~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Description

Sorry for my poor English!

When I run the code first time, it worked well.

But when I rerun the code, it run error as above.

I think it error because all nodes has its embedding already, so when run the code in the lib below: file: langchain_community\vectorstores\neo4j_vector.py from line 1421

        while True:
            fetch_query = (
                f"MATCH (n:`{node_label}`) "
                f"WHERE n.{embedding_node_property} IS null "
                "AND any(k in $props WHERE n[k] IS NOT null) "
                f"RETURN elementId(n) AS id, reduce(str='',"
                "k IN $props | str + '\\n' + k + ':' + coalesce(n[k], '')) AS text "
                "LIMIT 1000"
            )
            data = store.query(fetch_query, params={"props": text_node_properties})
            text_embeddings = embedding.embed_documents([el["text"] for el in data])

This code will fetch some nodes which don't have embedding_node_property. Since all nodes in my neo4j already have embedding, so the data is a empty list. Then in the code following, 0 is out of an empty list's index. file: sentence_transformers\SentenceTransformer.py from line 563

        elif convert_to_numpy:
            if not isinstance(all_embeddings, np.ndarray):
                if all_embeddings[0].dtype == torch.bfloat16:
                    all_embeddings = np.asarray([emb.float().numpy() for emb in all_embeddings])
                else:
                    all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])

That's where the error happened.

I have got the answer from the bot, but I still think it is bug which needs to be fixed!

Thanks!

System Info

langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.10 langchain-huggingface==0.0.3 langchain-openai==0.1.10 langchain-text-splitters==0.2.2

windows 11 python3.12

SeeleZaych commented 1 month ago

And when I use OpenAIEmbeddings, this bug will not happen.

owaist37 commented 1 month ago

Same issue when using BedrockEmbeddings,

supreme-core commented 1 month ago

i believe the embeddings are different in terms of structure.

gengpeip commented 1 month ago

I'm encountering a similar issue. Could you please share how you managed to resolve it?

Checked other resources

  • [x] I added a very descriptive title to this issue.
  • [x] I searched the LangChain documentation with the integrated search.
  • [x] I used the GitHub search to find a similar question and didn't find it.
  • [x] I am sure that this is a bug in LangChain rather than my code.
  • [x] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.vectorstores import Neo4jVector
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
self.existing_graph_parts = Neo4jVector.from_existing_graph(
    embedding=embeddings,
    url=uri,
    username=username,
    password=password,
    node_label="part",
    text_node_properties=["name"],
    embedding_node_property="embedding",
)

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "D:\graph_rag.py", line 133, in <module>
    graph_rag = GraphRag()
                ^^^^^^^^^^
  File "D:\graph_rag.py", line 44, in __init__
    self.existing_graph_parts = Neo4jVector.from_existing_graph(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\langchain_community\vectorstores\neo4j_vector.py", line 1431, in from_existing_graph
    text_embeddings = embedding.embed_documents([el["text"] for el in data])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\langchain_huggingface\embeddings\huggingface.py", line 87, in embed_documents
    embeddings = self.client.encode(
                 ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 565, in encode
    if all_embeddings[0].dtype == torch.bfloat16:
       ~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Description

Sorry for my poor English!

When I run the code first time, it worked well.

But when I rerun the code, it run error as above.

I think it error because all nodes has its embedding already, so when run the code in the lib below: file: langchain_community\vectorstores\neo4j_vector.py from line 1421

        while True:
            fetch_query = (
                f"MATCH (n:`{node_label}`) "
                f"WHERE n.{embedding_node_property} IS null "
                "AND any(k in $props WHERE n[k] IS NOT null) "
                f"RETURN elementId(n) AS id, reduce(str='',"
                "k IN $props | str + '\\n' + k + ':' + coalesce(n[k], '')) AS text "
                "LIMIT 1000"
            )
            data = store.query(fetch_query, params={"props": text_node_properties})
            text_embeddings = embedding.embed_documents([el["text"] for el in data])

This code will fetch some nodes which don't have embedding_node_property. Since all nodes in my neo4j already have embedding, so the data is a empty list. Then in the code following, 0 is out of an empty list's index. file: sentence_transformers\SentenceTransformer.py from line 563

        elif convert_to_numpy:
            if not isinstance(all_embeddings, np.ndarray):
                if all_embeddings[0].dtype == torch.bfloat16:
                    all_embeddings = np.asarray([emb.float().numpy() for emb in all_embeddings])
                else:
                    all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])

That's where the error happened.

I have got the answer from the bot, but I still think it is bug which needs to be fixed!

Thanks!

System Info

langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.10 langchain-huggingface==0.0.3 langchain-openai==0.1.10 langchain-text-splitters==0.2.2

windows 11 python3.12

SeeleZaych commented 1 month ago

I'm encountering a similar issue. Could you please share how you managed to resolve it?

Checked other resources

  • [x] I added a very descriptive title to this issue.
  • [x] I searched the LangChain documentation with the integrated search.
  • [x] I used the GitHub search to find a similar question and didn't find it.
  • [x] I am sure that this is a bug in LangChain rather than my code.
  • [x] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.vectorstores import Neo4jVector
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
self.existing_graph_parts = Neo4jVector.from_existing_graph(
    embedding=embeddings,
    url=uri,
    username=username,
    password=password,
    node_label="part",
    text_node_properties=["name"],
    embedding_node_property="embedding",
)

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "D:\graph_rag.py", line 133, in <module>
    graph_rag = GraphRag()
                ^^^^^^^^^^
  File "D:\graph_rag.py", line 44, in __init__
    self.existing_graph_parts = Neo4jVector.from_existing_graph(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\langchain_community\vectorstores\neo4j_vector.py", line 1431, in from_existing_graph
    text_embeddings = embedding.embed_documents([el["text"] for el in data])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\langchain_huggingface\embeddings\huggingface.py", line 87, in embed_documents
    embeddings = self.client.encode(
                 ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 565, in encode
    if all_embeddings[0].dtype == torch.bfloat16:
       ~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Description

Sorry for my poor English! When I run the code first time, it worked well. But when I rerun the code, it run error as above. I think it error because all nodes has its embedding already, so when run the code in the lib below: file: langchain_community\vectorstores\neo4j_vector.py from line 1421

        while True:
            fetch_query = (
                f"MATCH (n:`{node_label}`) "
                f"WHERE n.{embedding_node_property} IS null "
                "AND any(k in $props WHERE n[k] IS NOT null) "
                f"RETURN elementId(n) AS id, reduce(str='',"
                "k IN $props | str + '\\n' + k + ':' + coalesce(n[k], '')) AS text "
                "LIMIT 1000"
            )
            data = store.query(fetch_query, params={"props": text_node_properties})
            text_embeddings = embedding.embed_documents([el["text"] for el in data])

This code will fetch some nodes which don't have embedding_node_property. Since all nodes in my neo4j already have embedding, so the data is a empty list. Then in the code following, 0 is out of an empty list's index. file: sentence_transformers\SentenceTransformer.py from line 563

        elif convert_to_numpy:
            if not isinstance(all_embeddings, np.ndarray):
                if all_embeddings[0].dtype == torch.bfloat16:
                    all_embeddings = np.asarray([emb.float().numpy() for emb in all_embeddings])
                else:
                    all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])

That's where the error happened. I have got the answer from the bot, but I still think it is bug which needs to be fixed! Thanks!

System Info

langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.10 langchain-huggingface==0.0.3 langchain-openai==0.1.10 langchain-text-splitters==0.2.2 windows 11 python3.12

I add a determine statement in the file "C:\Users\MyUserName\AppData\Local\Programs\Python\Python312\Lib\site-packages\langchain_community\vectorstores\neo4j_vector.py" from about line 1431:

data = store.query(fetch_query, params={"props": text_node_properties})

if len(data) == 0:
      break
text_embeddings = embedding.embed_documents([el["text"] for el in data])

I just add if len(data) == 0, but I don't know whether there will be some side effects.

Hope this helps you!

YuffieHuang commented 1 month ago

I'm encountering a similar issue. Could you please share how you managed to resolve it?

Checked other resources

  • [x] I added a very descriptive title to this issue.
  • [x] I searched the LangChain documentation with the integrated search.
  • [x] I used the GitHub search to find a similar question and didn't find it.
  • [x] I am sure that this is a bug in LangChain rather than my code.
  • [x] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_community.vectorstores import Neo4jVector
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
self.existing_graph_parts = Neo4jVector.from_existing_graph(
    embedding=embeddings,
    url=uri,
    username=username,
    password=password,
    node_label="part",
    text_node_properties=["name"],
    embedding_node_property="embedding",
)

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "D:\graph_rag.py", line 133, in <module>
    graph_rag = GraphRag()
                ^^^^^^^^^^
  File "D:\graph_rag.py", line 44, in __init__
    self.existing_graph_parts = Neo4jVector.from_existing_graph(
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\langchain_community\vectorstores\neo4j_vector.py", line 1431, in from_existing_graph
    text_embeddings = embedding.embed_documents([el["text"] for el in data])
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\langchain_huggingface\embeddings\huggingface.py", line 87, in embed_documents
    embeddings = self.client.encode(
                 ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\syh\AppData\Local\Programs\Python\Python312\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 565, in encode
    if all_embeddings[0].dtype == torch.bfloat16:
       ~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Description

Sorry for my poor English! When I run the code first time, it worked well. But when I rerun the code, it run error as above. I think it error because all nodes has its embedding already, so when run the code in the lib below: file: langchain_community\vectorstores\neo4j_vector.py from line 1421

        while True:
            fetch_query = (
                f"MATCH (n:`{node_label}`) "
                f"WHERE n.{embedding_node_property} IS null "
                "AND any(k in $props WHERE n[k] IS NOT null) "
                f"RETURN elementId(n) AS id, reduce(str='',"
                "k IN $props | str + '\\n' + k + ':' + coalesce(n[k], '')) AS text "
                "LIMIT 1000"
            )
            data = store.query(fetch_query, params={"props": text_node_properties})
            text_embeddings = embedding.embed_documents([el["text"] for el in data])

This code will fetch some nodes which don't have embedding_node_property. Since all nodes in my neo4j already have embedding, so the data is a empty list. Then in the code following, 0 is out of an empty list's index. file: sentence_transformers\SentenceTransformer.py from line 563

        elif convert_to_numpy:
            if not isinstance(all_embeddings, np.ndarray):
                if all_embeddings[0].dtype == torch.bfloat16:
                    all_embeddings = np.asarray([emb.float().numpy() for emb in all_embeddings])
                else:
                    all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])

That's where the error happened. I have got the answer from the bot, but I still think it is bug which needs to be fixed! Thanks!

System Info

langchain==0.2.6 langchain-community==0.2.6 langchain-core==0.2.10 langchain-huggingface==0.0.3 langchain-openai==0.1.10 langchain-text-splitters==0.2.2 windows 11 python3.12

I add a determine statement in the file "C:\Users\MyUserName\AppData\Local\Programs\Python\Python312\Lib\site-packages\langchain_community\vectorstores\neo4j_vector.py" from about line 1431:

data = store.query(fetch_query, params={"props": text_node_properties})

if len(data) == 0:
      break
text_embeddings = embedding.embed_documents([el["text"] for el in data])

I just add if len(data) == 0, but I don't know whether there will be some side effects.

Hope this helps you!

It works for me. No side effects so far.

Astroa7m commented 4 weeks ago

@SeeleZaych Solution worked for me, hopefully we get a fix of this soon.