neo4j-labs / llm-graph-builder

Neo4j graph construction from unstructured data using LLMs
https://neo4j.com/labs/genai-ecosystem/llm-graph-builder/
Apache License 2.0
2.09k stars 317 forks source link

How to modify the ChatPromptTemplate used to make the graph extraction? #618

Closed aadnts closed 1 month ago

aadnts commented 1 month ago

This prompt template is somehow sent through to OpenAI but there's no trace of it within the codebase, therefore making it impossible to modify the system and human message.

The only place where it appears is inside the .ipynb files of the experiments folder, but the get_extraction_chain function is also not imported into the code.

I want to make the prompt more specific, because the results I am getting at the moment are non-satisfactory.

I found that in the project's history, this prompt was available for anyone to modify, why change that ? link

Anybody could help me out?

Knowledge Graph Instructions for GPT-4

          ## 1. Overview
          You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
          Try to capture as much information from the text as possible without sacrificing accuracy. Do not add any information that is not explicitly mentioned in the text.
          - **Nodes** represent entities and concepts.
          - The aim is to achieve simplicity and clarity in the knowledge graph, making it
          accessible for a vast audience.
          ## 2. Labeling Nodes
          - **Consistency**: Ensure you use available types for node labels.
          Ensure you use basic or elementary types for node labels.
          - For example, when you identify an entity representing a person, always label it as **'person'**. Avoid using more specific terms like 'mathematician' or 'scientist'.- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.
          - **Relationships** represent connections between entities or concepts.
          Ensure consistency and generality in relationship types when constructing knowledge graphs. Instead of using specific and momentary types such as 'BECAME_PROFESSOR', use more general and timeless relationship types like 'PROFESSOR'. Make sure to use general and timeless relationship types!
          ## 3. Coreference Resolution
          - **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
          If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"),always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the entity ID.
          Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial.
          ## 4. Strict Compliance
          Adhere to the rules strictly. Non-compliance will result in termination.
aadnts commented 1 month ago

@praveshkumar1988 @jexp @kartikpersistent

aadnts commented 1 month ago

It seems you guys have uploaded these prompt templates on a GCS bucket but why do so?

Is there any documentation you can provide on how to set-up our own GCS bucket so that it works seamlessly with the Graph Builder?

BUCKET_UPLOAD = 'llm-graph-builder-upload' BUCKET_FAILED_FILE = 'llm-graph-builder-failed' PROJECT_ID = 'llm-experiments-387609'

kartikpersistent commented 1 month ago

Hi @aadnts please make sure to keep GCS_FILE_CACHE env variable to false if you want to process files locally

aadnts commented 1 month ago

Hi @kartikpersistent, thank you for your answer!

I really appreciate you taking these few seconds to give me an answer....

I don't want to process files locally, I want to use chat-gpt-4o-mini (which I've already added as an option).

What I want, is to change the ChatPromptTemplate for it to include relationship constraints, which will definitely help me get drastically better results.

However, you guys decided that the prompt used to extract the graph should not be in the codebase right? To prevent others from modifying it? Why?

Nevermind, that's ok with me, but could you at least give me a hint on how to set-up my own GCS bucket with my prompts and stuff? Otherwise, this is pretty limitating....

@praveshkumar1988 @jexp @nielsdejong

kartikpersistent commented 1 month ago

Ex: BUCKET_UPLOAD = 'llm-graph-builder-upload' BUCKET_FAILED_FILE = 'llm-graph-builder-failed' PROJECT_ID = 'llm-experiments-387609'

Hi @aadnts update these values with your gcs creds BUCKET_UPLOAD="Your Bucket name for storing the uploaded files" BUCKET_FAILED_FILE ="Your Bucket name for storing the failed files" PROJECT_ID ="Your project ID"

Install the gcloud cli Run the gcloud auth application-default login command

Then you are good to go to test the app with your own gcs configuration

aadnts commented 1 month ago

Hi @kartikpersistent, thank you for your answer!

It seems I haven't expressed myself correctly, as you haven't addressed the central topic of my issue.

I only have one single question : how can I change the prompt template which is sent to the LLM in order to extract the graph ?

Please find below the prompt I am referring to, which is sent to ChatGPT everytime I generate a graph from my local files (and that I would like to modify) :

1. Overview

      You are a top-tier algorithm designed for extracting information in structured formats to build a knowledge graph.
      Try to capture as much information from the text as possible without sacrificing accuracy. Do not add any information that is not explicitly mentioned in the text.
      - **Nodes** represent entities and concepts.
      - The aim is to achieve simplicity and clarity in the knowledge graph, making it
      accessible for a vast audience.
      ## 2. Labeling Nodes
      - **Consistency**: Ensure you use available types for node labels.
      Ensure you use basic or elementary types for node labels.
      - For example, when you identify an entity representing a person, always label it as **'person'**. Avoid using more specific terms like 'mathematician' or 'scientist'.- **Node IDs**: Never utilize integers as node IDs. Node IDs should be names or human-readable identifiers found in the text.
      - **Relationships** represent connections between entities or concepts.
      Ensure consistency and generality in relationship types when constructing knowledge graphs. Instead of using specific and momentary types such as 'BECAME_PROFESSOR', use more general and timeless relationship types like 'PROFESSOR'. Make sure to use general and timeless relationship types!
      ## 3. Coreference Resolution
      - **Maintain Entity Consistency**: When extracting entities, it's vital to ensure consistency.
      If an entity, such as "John Doe", is mentioned multiple times in the text but is referred to by different names or pronouns (e.g., "Joe", "he"),always use the most complete identifier for that entity throughout the knowledge graph. In this example, use "John Doe" as the entity ID.
      Remember, the knowledge graph should be coherent and easily understandable, so maintaining consistency in entity references is crucial.
      ## 4. Strict Compliance
      Adhere to the rules strictly. Non-compliance will result in termination.

image

@praveshkumar1988 @jexp @nielsdejong @karanchellani

kartikpersistent commented 1 month ago

It is coming from the neo4j library we can't change that

kartikpersistent commented 1 month ago

@aashipandya can answer this query in detail

aadnts commented 1 month ago

Thank you for your answer @kartikpersistent

It used to be in the codebase link , why remove it though?

Is there really no way around that ? @aashipandya

This is a critical component of the entire pipeline, for instance, I'm planning to add a relationship constraints builder (c.f. ontology constraints) on the graph enhancement feature, which would enable users to define which entities can be the subject/object of a specific relationship.

This would drastically increase the quality of the graph generated by the LLM, as per my prior testing.

Without the possibility to alter the prompt, this LLM Graph Builder solution loses so much of its potential, especially considering it aims to be open-source right?

Please help me out.

@praveshkumar1988 @jexp @nielsdejong @karanchellani

aashipandya commented 1 month ago

Hi @aadnts We are using LLMGraphTransformer library from langchain to get nodes and relationships. The prompt you are taking about it defined there. If you want to give your own prompt, you can try passing it to the library. Ex:LLMGraphTransformer( llm=llm, prompt=your_prompt ) Also, prompt is not moved to any bucket, uploaded file from local is saved to gcs bucket if GCS_FILE_CACHE env variable is set to true, otherwise files will be saved in memory

aadnts commented 1 month ago

Dear @aashipandya !

Thank you so much for your response, this makes it all so much clearer.

I am really grateful for this reply, you have made my day!

Amazing props to you and the team 💯