Open jonmach opened 6 months ago
I got the same message using Ollama and a long text (Quixote book in txt)
I dunno if this is useful, but I found a workaround the whole json issue. It will require using a nested function and having the AI output the text formatted AS JSON but in reality, its markdown:
import nltk
nltk.download('punkt')
# Define the nest_sentences function for batching
def nest_sentences(document):
nested = []
sent = []
length = 0
for sentence in nltk.sent_tokenize(document):
length += len(sentence)
if length < 1024:
sent.append(sentence)
else:
nested.append(" ".join(sent))
sent = [sentence]
length = len(sentence)
if sent:
nested.append(" ".join(sent))
return nested
def extractConcepts(prompt: str, model='mistral:latest'):
SYS_PROMPT = (
"Your task is to extract the key entities mentioned in the users input.\n"
"Entities may include - event, concept, person, place, object, document, organisation, artifact, misc, etc.\n"
"Format your output as a list of json with the following structure.\n"
"[{\n"
" \"entity\": The Entity string\n"
" \"importance\": How important is the entity given the context on a scale of 1 to 5, 5 being the highest.\n"
" \"type\": Type of entity\n"
"}, { }]"
)
response, context = client.generate(model_name=model, system=SYS_PROMPT, prompt=prompt)
# Initialize markdown_output at the start of the function
markdown_output = ""
# Check if response is in the expected list of dictionaries format
if isinstance(response, list) and all(isinstance(item, dict) for item in response):
for item in response:
markdown_output += (
f"## {item['entity']} ({item['type']})\n- {item['question']}\n\n"
)
return markdown_output
# Process each page's content in batches and generate questions
all_questions = []
for page in pages:
batches = nest_sentences(page.page_content)
for batch in batches:
batch_questions = extractConcepts(prompt=batch)
if batch_questions:
all_questions.extend(batch_questions)
for question in batch_questions:
print(batch_questions)
This is the output:
[
{
"entity": "A",
"importance": 3,
"type": "concept"
},
{
"entity": "Ockham",
"importance": 4,
"type": "person"
},
{
"entity": "God",
"importance": 5,
"type": "deity"
},
{
"entity": "power",
"importance": 2,
"type": "concept"
},
{
"entity": "cognition",
"importance": 4,
"type": "concept"
},
{
"entity": "proposition",
"importance": 3,
"type": "concept"
}
]
I haven't created a PR for this as I'm not using OpenAI or Ollama and have modified multiple things.
Your SYSTEM PROMPT is currently equal to:
The
, {...}\n"
seems to be causing poor JSON outputs every now again, because some of the results literally have a trailing,{...}
. This seems due to hallucination and poor understanding by the LLM. Probably more prone in an open source one.This is easy to resolve by simply removing that trailing
,{...}
. I've changed the wording slightly and am now getting no"ERROR ### Here is the buggy response:"
errors