ERROR ### Here is the buggy response - easily resolved

jonmach commented 6 months ago

I haven't created a PR for this as I'm not using OpenAI or Ollama and have modified multiple things.

Your SYSTEM PROMPT is currently equal to:

    SYS_PROMPT = (
        "You are a network graph maker who extracts terms and their relations from a given context. "
        "You are provided with a context chunk (delimited by ```) Your task is to extract the ontology "
        "of terms mentioned in the given context. These terms should represent the key concepts as per the context. \n"
        "Thought 1: While traversing through each sentence, Think about the key terms mentioned in it.\n"
            "\tTerms may include object, entity, location, organization, person, \n"
            "\tcondition, acronym, documents, service, concept, etc.\n"
            "\tTerms should be as atomistic as possible\n\n"
        "Thought 2: Think about how these terms can have one on one relation with other terms.\n"
            "\tTerms that are mentioned in the same sentence or the same paragraph are typically related to each other.\n"
            "\tTerms can be related to many other terms\n\n"
        "Thought 3: Find out the relation between each such related pair of terms. \n\n"
        "Format your output as a list of json. Each element of the list contains a pair of terms"
        "and the relation between them, like the follwing: \n"
        "[\n"
        "   {\n"
        '       "node_1": "A concept from extracted ontology",\n'
        '       "node_2": "A related concept from extracted ontology",\n'
        '       "edge": "relationship between the two concepts, node_1 and node_2 in one or two sentences"\n'
        "   }, {...}\n"
        "]"
    )

The , {...}\n" seems to be causing poor JSON outputs every now again, because some of the results literally have a trailing ,{...}. This seems due to hallucination and poor understanding by the LLM. Probably more prone in an open source one.

This is easy to resolve by simply removing that trailing ,{...}. I've changed the wording slightly and am now getting no "ERROR ### Here is the buggy response:" errors

thejosess commented 6 months ago

I got the same message using Ollama and a long text (Quixote book in txt)

roomals commented 4 months ago

I dunno if this is useful, but I found a workaround the whole json issue. It will require using a nested function and having the AI output the text formatted AS JSON but in reality, its markdown:


import nltk
nltk.download('punkt')

# Define the nest_sentences function for batching
def nest_sentences(document):
    nested = []
    sent = []
    length = 0
    for sentence in nltk.sent_tokenize(document):
        length += len(sentence)
        if length < 1024:
            sent.append(sentence)
        else:
            nested.append(" ".join(sent))
            sent = [sentence]
            length = len(sentence)
    if sent:
        nested.append(" ".join(sent))
    return nested

def extractConcepts(prompt: str, model='mistral:latest'):
    SYS_PROMPT = (
        "Your task is to extract the key entities mentioned in the users input.\n"
        "Entities may include - event, concept, person, place, object, document, organisation, artifact, misc, etc.\n"
        "Format your output as a list of json with the following structure.\n"
        "[{\n"
        "   \"entity\": The Entity string\n"
        "   \"importance\": How important is the entity given the context on a scale of 1 to 5, 5 being the highest.\n"
        "   \"type\": Type of entity\n"
        "}, { }]"
    )

    response, context = client.generate(model_name=model, system=SYS_PROMPT, prompt=prompt)

    # Initialize markdown_output at the start of the function
    markdown_output = ""

    # Check if response is in the expected list of dictionaries format
    if isinstance(response, list) and all(isinstance(item, dict) for item in response):
        for item in response:
            markdown_output += (
                f"## {item['entity']} ({item['type']})\n- {item['question']}\n\n"
            )

    return markdown_output

# Process each page's content in batches and generate questions
all_questions = []
for page in pages:
    batches = nest_sentences(page.page_content)
    for batch in batches:
        batch_questions = extractConcepts(prompt=batch)
        if batch_questions:
            all_questions.extend(batch_questions)
            for question in batch_questions:
                print(batch_questions)

This is the output:

[
  {
    "entity": "A",
    "importance": 3,
    "type": "concept"
  },
  {
    "entity": "Ockham",
    "importance": 4,
    "type": "person"
  },
  {
    "entity": "God",
    "importance": 5,
    "type": "deity"
  },
  {
    "entity": "power",
    "importance": 2,
    "type": "concept"
  },
  {
    "entity": "cognition",
    "importance": 4,
    "type": "concept"
  },
  {
    "entity": "proposition",
    "importance": 3,
    "type": "concept"
  }
]

rahulnyk / knowledge_graph

ERROR ### Here is the buggy response - easily resolved #22