Closed jupiterepoch closed 8 months ago
Thanks a lot for catching this! Do you have an example of where that might happen?
Hmm, thanks! We do already do CoT+SC with dsp.majority in the intro notebook.
Here's a simplified version.
@dsp.transformation
def QA_predict(example: dsp.Example):
example, completions = dsp.generate(qa_template_with_CoT, n=20, temperature=0.7)(example, stage='qa')
completions = dsp.majority(completions)
return example.copy(answer=completions.answer)
The only difference I see is that we extract the answer at the end. Is there anyother difference?
Maybe you're saying we have some regression. The same code from the intro notebook no longer works on the latest branch?
I think I know what the problem is. text-davinci-001
sometimes returns a List[str]
, whereas gpt-3.5-turbo
or text-davinci-002
consistently returns str
in my experiments. You could try running the intro notebook with text-davinci-001.
I guess either specifying that the code does not support certain GPT models, or adding codes to handle the different LM behaviors could work.
If you want to go down the second path, the evaluate
would probably need to change as well.
If your PR fixes this, I'm happy to merge it. But I'm unclear on what you mean by text-davinci-001 returning a list. Don't all models return a list of strings (i.e., a list of completions)? Are you saying that text-davinci-001 returns a list of lists?
If your PR fixes this, I'm happy to merge it.
Please don't merge just yet. As I mentioned, the evaluate
function has similar problems. If you want to fix it, it's better to fix both.
But I'm unclear on what you mean by text-davinci-001 returning a list. Don't all models return a list of strings (i.e., a list of completions)? Are you saying that text-davinci-001 returns a list of lists?
Since completions: list[dict[str, Any]] = generator(prompt, **kwargs)
at here does not check for the type of the dict values, a completion in completions could have completion[prediction_field]
being multiple types. Different LMs do fill that prediction_field with different types. And downstream functions such as majority
does not take this into consideration.
I'm on a $5 budget for openai calls, so I'm afraid I won't be able to help you test more cases.
Thanks for reporting this and helping with it. We haven’t been able to reproduce it unfortunately. cc: @lawliet19189
Hi @okhat , I believe the problem may come from the extract
function in template_v2.py
(https://github.com/stanfordnlp/dsp/blob/12d7f1106f2f524be161d35865a81e0ea016e929/dsp/templates/template_v2.py#L131).
The function takes an example, and overwrites example.answer
if an answer is found. However, in some case, the response from LM is malformed. For example, when we add rationale
to our template, we expect the LM to follow Answer: {LM's answer}
format. but we do not enforce it.For example, here is one possible output generated from LM:
Context:
[1] «BTC-e | Gina Esparon and Evaline Sophie Joubert and two people with significant control: Alexander Buyanov and Andrii Shvets. The US Justice Dept attempted to close down BTC-e on the 26th of July 2017 when they charged Alexander Vinnik and BTC-e in a 21-count indictment for operating an alleged international money laundering scheme and allegedly laundering funds from the hack of Mt. Gox. The exchange plans to relaunch in September 2018. BTC-e started in July 2011, handling a few coin pairs, including Bitcoin/U. S. dollar and I0Coin to Bitcoin. By October 2011, they supported many different currency pairs, including Litecoin to dollars,»
[2] «Andy Harris (politician) | Furthermore, "Harris then asked if he could purchase insurance from the government to cover the gap," added an aide, who was struck by the similarity to Harris's request and the public option he denounced as a gateway to socialized medicine. On October 16, 2013, Harris voted against the motion to end the government shutdown and raise the debt ceiling. Harris has defended Hungarian leader Viktor Orbán, an ally of Vladimir Putin who has pledged to turn his country into an "illiberal democracy". Harris decried the State Departments' efforts to support free and independent media in the country. In 2014, Harris»
[3] «BuysUSA.com | a Mercedes and other luxury vehicles with his profits. The Federal Bureau of Investigation investigated the site and was ultimately responsible for shutting down the site permanently in October 2005. On Friday, August 25, 2006, Danny Ferrer appeared before U.S. District Court Judge T.T. Ellis III in Alexandria, Virginia. He was told, "You extended your hand into the pockets of these people," and that, "If severe penalties were not attached, people would line up from here to Los Angeles to do what you've done." Danny Ferrer claimed to the judge that he started selling the software to pay for a»
[4] «United States federal government shutdown of 1990 | the Smithsonian museums. Although none of the appropriations bills had been passed, not all government agencies actually shut down. Full shutdowns occurred in the Environmental Protection Agency, Department of Labor, Office of Personnel Management, and Department of Housing and Urban Development, and partial shutdowns occurred in the Library of Congress, Government Printing Office, and the Departments of Energy, Interior, and State. According to a study by the General Accounting Office, Interior (which includes the National Park Service) furloughed about 2,800 workers, the Library of Congress around 100, and the other agencies fewer than 10 each. In the morning of Tuesday,»
[5] «Jinkanpo Atsugi Incinerator | the poor air quality. In May 2001, the Japanese government purchased the plant for nearly 40 million dollars and shut it down following a United States Department of Justice lawsuit against the private incinerator owner. Dismantling was completed by the end of that year. Some former residents of Atsugi NAF still complain of health problems related to the incinerator's emissions and report that the USN has been reluctant to address their concerns. The incinerator contaminated the base, especially the housing area, with dioxin, heavy metals, and other deadly toxins. In June, 2007, the USN's Environmental Health Center announced that it»
Question: What government department did Buyantu shut down?
Rationale: Let's think step by step. There is no mention of Buyantu shutting down a government department in the provided context. Therefore, the answer is: N/A.
In this case, the template fails to extract N/A
from the answer, and leaves example.answer
as is.
In dataset like SQUAD
, example.answer
is usually a list, so that's the reason why completion.answer
is a list.
My guess comes from the following observation. Here is a code snippet from my CS224U homework:
test_case = dev_exs[56] # an entry of dev set of SQUAD
print("Question:", test_case.question)
print("Expected answer:", test_case.answer)
print(“Generated answer:”, few_shot_openqa_custom(test_case, 3).answer)
And this is the output:
Question: What government department did Buyantu shut down?
Expected answer: ['the Department of State Affairs', 'Department of State Affairs', 'the Department of State Affairs']
Generated answer: ['the Department of State Affairs', 'Department of State Affairs', 'the Department of State Affairs']
If you want, I can write a PR to set answer
to an empty string if template fails to extract an answer. Otherwise, this may lead to overestimation of performance during evaluation.
Thanks @tsunrise ! I strongly recommend not passing a populated .answer field to your program. Otherwise you risk unintended leakage one way or another. Luckily, your evaluation function seems to expect a string and will raise an exception, I believe, if you try feeding it a list, so hopefully this ensures no leakage in your specific case.
If you remove that field before calling generate, dsp.generate will invoke the LLM as many times as needed to make sure all output fields are produced.
A PR that unsets output fields that are not overwritten by extract will be highly welcome, although the correct semantics in this case seem ambiguous: what does the user expect when some of the output fields are already set?
In general, always make sure to pass only the questions (or example objects with only questions set) to your program. (Or, at least, unset the output values you expect dsp.generate to produce, when there are multiple output fields.)
Thanks @tsunrise ! I strongly recommend not passing a populated .answer field to your program. Otherwise you risk unintended leakage one way or another. Luckily, your evaluation function seems to expect a string and will raise an exception, I believe, if you try feeding it a list, so hopefully this ensures no leakage in your specific case.
If you remove that field before calling generate, dsp.generate will invoke the LLM as many times as needed to make sure all output fields are produced.
A PR that unsets output fields that are not overwritten by extract will be highly welcome, although the correct semantics in this case seem ambiguous: what does the user expect when some of the output fields are already set?
In general, always make sure to pass only the questions (or example objects with only questions set) to your program. (Or, at least, unset the output values you expect dsp.generate to produce, when there are multiple output fields.)
Sounds good! I have created a PR for that
Thanks for opening this, this got fundamentally resolved in v2 (DSPy). Closing.
Link to the code snippet in question: https://github.com/stanfordnlp/dsp/blob/12d7f1106f2f524be161d35865a81e0ea016e929/dsp/primitives/predict.py#L211
Expected behavior: The function should work with
pred[prediction_field]
of typeUnion[List[str], str]
.Current behavior: The function raises an error when calling
normalize_text
onpred[prediction_field]
when it is of typeList[str]
.