LLM output appears cut-off (compared to the output of the same LLM and prompt in Langchain)

DSLituiev commented 4 months ago

I ran into seeing an output (in this case a requested JSON string) appearing to be cut-off, what could be a premature EOS. However, the same prompt returns a complete output when called from LangChain

Here is a DSPy example:


class ExtractNumericSnippets(dspy.Signature):
    ('From a given text, extract a list of factoids with numbers verbatim from the text, such as periods of time, identifiers of studies or compounds, numbers per cohort, concentrations etc. \n'
    'for each factoid provide: {"verbatim": [include the digits as well as units and measurements if contigous], "number": [...], "unit": [...], "measurement": [...]} \n'
   'Include what was measured in "verbatim" field for measurements if it is not interrupted in the source text by other identifiers or digits, e.g.: "30 days after treatment", "40 mg of aspirin" etc. \n'
   'If multiple values are provided in a row, e.g.: "exposure at 1, 10, and 30 min", keep the phrase with the measurement, unit, and the sequence of numbers together and in "number" field provide a list of numbers\n'
    '''Measurement should be one of the following:
      - count of [entities such as animals, groups, apples etc]
      - ratio of [solution of compoud X, Y, and Z; etc] -- extract numbers a string in such case, e.g. '10:20:30'
      - percentage of [incidence of disease A, animals with certain symptom etc]
      - physical entity like duration, start time of [experiment name], end time of [experiment name] etc -- this should meaningfully match respective units
      - identifier of [name of the identified study, compound, spaceship etc] -- extract a string in such case; do include non-numeric symbols into the "number" field, e.g. "ABC-12456"
    In both "measurement" and "verbatim" field include indication of what kind of number (count, ratio, identifier etc) it is, e.g., "count of gold fish in the Milky pond" or "ratio of solution of honey in whiskey" etc.'''
    )
    draft_passage = dspy.InputField()
    answer = dspy.OutputField(desc=(
   'A JSON list of key-vaue pairs'
    ))

turbo = dspy.AzureOpenAI(
                         api_base=os.environ["AZURE_OPENAI_ENDPOINT"],
                         api_version='2023-05-15',
                         deployment_id='GPT-4-32K',
                         api_key=os.environ['OPENAI_API_KEY']
                        )

dspy.settings.configure(lm=turbo) #, rm=retriever

draft_passage = "All variants within the transmembrane domain, including the previously reported p.(Thr300Ile) variant, were characterized in silico and analyzed by molecular dynamics (MD) simulation studies. We identified three novel de novo missense variants in GABRA4 (NM_000809.4): c.797 C > T, p.(Pro266Leu), c.899 C > A, p.(Thr300Asn), and c.634 G > A, p.(Val212Ile)."

extract_numeric_snippets = dspy.Predict(ExtractNumericSnippets)
pred = extract_numeric_snippets(draft_passage=draft_passage)
print(pred.answer)

Here is the result (NB: the message ends with an incomplete JSON, in the middle of nested key-value):

[
  {"verbatim": "p.(Thr300Ile)", "number": "300", "unit": "", "measurement": "identifier of variant"},
  {"verbatim": "GABRA4 (NM_000809.4)", "number": "NM_000809.4", "unit": "", "measurement": "identifier of gene"},
  {"verbatim": "c.797 C > T", "number": "797", "unit": "", "measurement": "identifier of variant"},
  {"verbatim": "p.(Pro266Leu)", "number": "266", "unit": "", "measurement": "identifier of variant"},
  {"verbatim": "c.899 C > A", "number": "899", "

#                                       incomplete JSON  ^^^^^^^

Looking under the hood with turbo.inspect_history(n=1):

From a given text, extract a list of factoids with numbers verbatim from the text, such as periods of time, identifiers of studies or compounds, numbers per cohort, concentrations etc. 
for each factoid provide: {"verbatim": [include the digits as well as units and measurements if contigous], "number": [...], "unit": [...], "measurement": [...]} 
Include what was measured in "verbatim" field for measurements if it is not interrupted in the source text by other identifiers or digits, e.g.: "30 days after treatment", "40 mg of aspirin" etc. 
If multiple values are provided in a row, e.g.: "exposure at 1, 10, and 30 min", keep the phrase with the measurement, unit, and the sequence of numbers together and in "number" field provide a list of numbers
Measurement should be one of the following:
      - count of [entities such as animals, groups, apples etc]
      - ratio of [solution of compoud X, Y, and Z; etc] -- extract numbers a string in such case, e.g. '10:20:30'
      - percentage of [incidence of disease A, animals with certain symptom etc]
      - physical entity like duration, start time of [experiment name], end time of [experiment name] etc -- this should meaningfully match respective units
      - identifier of [name of the identified study, compound, spaceship etc] -- extract a string in such case; do include non-numeric symbols into the "number" field, e.g. "ABC-12456"
    In both "measurement" and "verbatim" field include indication of what kind of number (count, ratio, identifier etc) it is, e.g., "count of gold fish in the Milky pond" or "ratio of solution of honey in whiskey" etc.

---

Follow the following format.

Draft Passage: ${draft_passage}
Answer: A JSON list of key-vaue pairs

---

Draft Passage: All variants within the transmembrane domain, including the previously reported p.(Thr300Ile) variant, were characterized in silico and analyzed by molecular dynamics (MD) simulation studies. We identified three novel de novo missense variants in GABRA4 (NM_000809.4): c.797 C > T, p.(Pro266Leu), c.899 C > A, p.(Thr300Asn), and c.634 G > A, p.(Val212Ile).
Answer: [
  {"verbatim": "p.(Thr300Ile)", "number": "300", "unit": "", "measurement": "identifier of variant"},
  {"verbatim": "GABRA4 (NM_000809.4)", "number": "NM_000809.4", "unit": "", "measurement": "identifier of gene"},
  {"verbatim": "c.797 C > T", "number": "797", "unit": "", "measurement": "identifier of variant"},
  {"verbatim": "p.(Pro266Leu)", "number": "266", "unit": "", "measurement": "identifier of variant"},
  {"verbatim": "c.899 C > A", "number": "899", "

Taking this prompt to the same model in Langchain:

from langchain_core.messages import HumanMessage,  SystemMessage
from langchain_openai import AzureChatOpenAI
from langchain_openai.embeddings.azure import AzureOpenAIEmbeddings
embedding_function = AzureOpenAIEmbeddings(deployment="textEmbedding") # model="text-embedding-ada-002"

llm = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_deployment="GPT-4",
)

messages = [SystemMessage(
        content="""From a given text, extract a JSON list of factoids with numbers verbatim from the text, such as periods of time, identifiers of studies or compounds, numbers per cohort, concentrations etc. 
for each factoid provide: {"verbatim": [include the digits as well as units and measurements if contigous], "number": [...], "unit": [...], "measurement": [...]} 
Include what was measured in "verbatim" field for measurements if it is not interrupted in the source text by other identifiers or digits, e.g.: "30 days after treatment", "40 mg of aspirin" etc. 
If multiple values are provided in a row, e.g.: "exposure at 1, 10, and 30 min", keep the phrase with the measurement, unit, and the sequence of numbers together and in "number" field provide a list of numbers

---

Follow the following format.

Draft Passage: ${draft_passage}
Answer: A list of key-vaue pairs

---
"""),
    HumanMessage(
content=f"""Draft Passage: {draft_passage}
Answer:""")]
response = llm(messages)
print(response.content)

Output:

[
  {"verbatim": "p.(Thr300Ile)", "number": "300", "unit": "", "measurement": "Thr300Ile variant"},
  {"verbatim": "GABRA4 (NM_000809.4)", "number": "000809.4", "unit": "", "measurement": "GABRA4"},
  {"verbatim": "c.797 C > T", "number": "797", "unit": "", "measurement": "c.797 C > T"},
  {"verbatim": "p.(Pro266Leu)", "number": "266", "unit": "", "measurement": "Pro266Leu variant"},
  {"verbatim": "c.899 C > A", "number": "899", "unit": "", "measurement": "c.899 C > A"},
  {"verbatim": "p.(Thr300Asn)", "number": "300", "unit": "", "measurement": "Thr300Asn variant"},
  {"verbatim": "c.634 G > A", "number": "634", "unit": "", "measurement": "c.634 G > A"},
  {"verbatim": "p.(Val212Ile)", "number": "212", "unit": "", "measurement": "Val212Ile variant"}
]

I get the same result if I jam the whole prompt into SystemMessage of Langchain

detaos commented 4 months ago

You probably just hit max tokens. Try increasing it.

OpenAI example:

lm = dspy.OpenAI(
    [...],
    max_tokens = 4096,
)

Not sure what the Azure equivalent is.

DSLituiev commented 4 months ago

I see. This is a very anti-intuitive feature or should I call it bug. Can it be set to None or float('inf') by default? I just realized it is set to mere 150:

turbo.kwargs

{'temperature': 0.0,
 'max_tokens': 150,
 'top_p': 1,
 'frequency_penalty': 0,
 'presence_penalty': 0,
 'n': 1,
 'deployment_id': 'GPT-4-32K',
 'model': 'GPT-4-32K'}

Is there a reason to set it to such a tiny number by default?

detaos commented 4 months ago

You can certainly try setting max_tokens to None ... but I usually try to set it to a reasonable value for the use case at hand. If I'm expecting a short response, I don't want to wait for it to generate 100k tokens.

hasalams commented 4 months ago

You can certainly try setting max_tokens to None ... but I usually try to set it to a reasonable value for the use case at hand. If I'm expecting a short response, I don't want to wait for it to generate 100k tokens.

Agree, setting it to a reasonable value (depending on your expected output) is recommended rather than leaving it as None.

stanfordnlp / dspy

LLM output appears cut-off (compared to the output of the same LLM and prompt in Langchain) #765