text-davinci-001: majority_vote_ function does not handle cases where pred[prediction_field] is a list of str instead of str

jupiterepoch commented 1 year ago

Link to the code snippet in question: https://github.com/stanfordnlp/dsp/blob/12d7f1106f2f524be161d35865a81e0ea016e929/dsp/primitives/predict.py#L211

Expected behavior: The function should work with pred[prediction_field] of type Union[List[str], str].

Current behavior: The function raises an error when calling normalize_text on pred[prediction_field] when it is of type List[str].

okhat commented 1 year ago

Thanks a lot for catching this! Do you have an example of where that might happen?

jupiterepoch commented 1 year ago

okhat commented 1 year ago

Hmm, thanks! We do already do CoT+SC with dsp.majority in the intro notebook.

Here's a simplified version.

@dsp.transformation
def QA_predict(example: dsp.Example):
    example, completions = dsp.generate(qa_template_with_CoT, n=20, temperature=0.7)(example, stage='qa')
    completions = dsp.majority(completions)

    return example.copy(answer=completions.answer)

The only difference I see is that we extract the answer at the end. Is there anyother difference?

okhat commented 1 year ago

Maybe you're saying we have some regression. The same code from the intro notebook no longer works on the latest branch?

jupiterepoch commented 1 year ago

I think I know what the problem is. text-davinci-001 sometimes returns a List[str], whereas gpt-3.5-turbo or text-davinci-002 consistently returns str in my experiments. You could try running the intro notebook with text-davinci-001.

I guess either specifying that the code does not support certain GPT models, or adding codes to handle the different LM behaviors could work.

If you want to go down the second path, the evaluate would probably need to change as well.

okhat commented 1 year ago

If your PR fixes this, I'm happy to merge it. But I'm unclear on what you mean by text-davinci-001 returning a list. Don't all models return a list of strings (i.e., a list of completions)? Are you saying that text-davinci-001 returns a list of lists?

jupiterepoch commented 1 year ago

If your PR fixes this, I'm happy to merge it.

Please don't merge just yet. As I mentioned, the evaluate function has similar problems. If you want to fix it, it's better to fix both.

But I'm unclear on what you mean by text-davinci-001 returning a list. Don't all models return a list of strings (i.e., a list of completions)? Are you saying that text-davinci-001 returns a list of lists?

Since completions: list[dict[str, Any]] = generator(prompt, **kwargs) at here does not check for the type of the dict values, a completion in completions could have completion[prediction_field] being multiple types. Different LMs do fill that prediction_field with different types. And downstream functions such as majority does not take this into consideration.

I'm on a $5 budget for openai calls, so I'm afraid I won't be able to help you test more cases.

okhat commented 1 year ago

Thanks for reporting this and helping with it. We haven’t been able to reproduce it unfortunately. cc: @lawliet19189

tsunrise commented 1 year ago

Hi @okhat , I believe the problem may come from the extract function in template_v2.py (https://github.com/stanfordnlp/dsp/blob/12d7f1106f2f524be161d35865a81e0ea016e929/dsp/templates/template_v2.py#L131).

The function takes an example, and overwrites example.answer if an answer is found. However, in some case, the response from LM is malformed. For example, when we add rationale to our template, we expect the LM to follow Answer: {LM's answer} format. but we do not enforce it.For example, here is one possible output generated from LM:

Context:
[1] «BTC-e | Gina Esparon and Evaline Sophie Joubert and two people with significant control: Alexander Buyanov and Andrii Shvets. The US Justice Dept attempted to close down BTC-e on the 26th of July 2017 when they charged Alexander Vinnik and BTC-e in a 21-count indictment for operating an alleged international money laundering scheme and allegedly laundering funds from the hack of Mt. Gox. The exchange plans to relaunch in September 2018. BTC-e started in July 2011, handling a few coin pairs, including Bitcoin/U. S. dollar and I0Coin to Bitcoin. By October 2011, they supported many different currency pairs, including Litecoin to dollars,»
[2] «Andy Harris (politician) | Furthermore, "Harris then asked if he could purchase insurance from the government to cover the gap," added an aide, who was struck by the similarity to Harris's request and the public option he denounced as a gateway to socialized medicine. On October 16, 2013, Harris voted against the motion to end the government shutdown and raise the debt ceiling. Harris has defended Hungarian leader Viktor Orbán, an ally of Vladimir Putin who has pledged to turn his country into an "illiberal democracy". Harris decried the State Departments' efforts to support free and independent media in the country. In 2014, Harris»
[3] «BuysUSA.com | a Mercedes and other luxury vehicles with his profits. The Federal Bureau of Investigation investigated the site and was ultimately responsible for shutting down the site permanently in October 2005. On Friday, August 25, 2006, Danny Ferrer appeared before U.S. District Court Judge T.T. Ellis III in Alexandria, Virginia. He was told, "You extended your hand into the pockets of these people," and that, "If severe penalties were not attached, people would line up from here to Los Angeles to do what you've done." Danny Ferrer claimed to the judge that he started selling the software to pay for a»
[4] «United States federal government shutdown of 1990 | the Smithsonian museums. Although none of the appropriations bills had been passed, not all government agencies actually shut down. Full shutdowns occurred in the Environmental Protection Agency, Department of Labor, Office of Personnel Management, and Department of Housing and Urban Development, and partial shutdowns occurred in the Library of Congress, Government Printing Office, and the Departments of Energy, Interior, and State. According to a study by the General Accounting Office, Interior (which includes the National Park Service) furloughed about 2,800 workers, the Library of Congress around 100, and the other agencies fewer than 10 each. In the morning of Tuesday,»
[5] «Jinkanpo Atsugi Incinerator | the poor air quality. In May 2001, the Japanese government purchased the plant for nearly 40 million dollars and shut it down following a United States Department of Justice lawsuit against the private incinerator owner. Dismantling was completed by the end of that year. Some former residents of Atsugi NAF still complain of health problems related to the incinerator's emissions and report that the USN has been reluctant to address their concerns. The incinerator contaminated the base, especially the housing area, with dioxin, heavy metals, and other deadly toxins. In June, 2007, the USN's Environmental Health Center announced that it»

Question: What government department did Buyantu shut down?

Rationale: Let's think step by step. There is no mention of Buyantu shutting down a government department in the provided context. Therefore, the answer is: N/A.

In this case, the template fails to extract N/A from the answer, and leaves example.answer as is.

In dataset like SQUAD, example.answer is usually a list, so that's the reason why completion.answer is a list.

My guess comes from the following observation. Here is a code snippet from my CS224U homework:

test_case = dev_exs[56] # an entry of dev set of SQUAD 
print("Question:", test_case.question)
print("Expected answer:", test_case.answer)
print(“Generated answer:”, few_shot_openqa_custom(test_case, 3).answer)

And this is the output:

Question: What government department did Buyantu shut down?
Expected answer: ['the Department of State Affairs', 'Department of State Affairs', 'the Department of State Affairs']
Generated answer: ['the Department of State Affairs', 'Department of State Affairs', 'the Department of State Affairs']

If you want, I can write a PR to set answer to an empty string if template fails to extract an answer. Otherwise, this may lead to overestimation of performance during evaluation.

okhat commented 1 year ago

Thanks @tsunrise ! I strongly recommend not passing a populated .answer field to your program. Otherwise you risk unintended leakage one way or another. Luckily, your evaluation function seems to expect a string and will raise an exception, I believe, if you try feeding it a list, so hopefully this ensures no leakage in your specific case.

If you remove that field before calling generate, dsp.generate will invoke the LLM as many times as needed to make sure all output fields are produced.

A PR that unsets output fields that are not overwritten by extract will be highly welcome, although the correct semantics in this case seem ambiguous: what does the user expect when some of the output fields are already set?

In general, always make sure to pass only the questions (or example objects with only questions set) to your program. (Or, at least, unset the output values you expect dsp.generate to produce, when there are multiple output fields.)

tsunrise commented 1 year ago

Thanks @tsunrise ! I strongly recommend not passing a populated .answer field to your program. Otherwise you risk unintended leakage one way or another. Luckily, your evaluation function seems to expect a string and will raise an exception, I believe, if you try feeding it a list, so hopefully this ensures no leakage in your specific case.

If you remove that field before calling generate, dsp.generate will invoke the LLM as many times as needed to make sure all output fields are produced.

A PR that unsets output fields that are not overwritten by extract will be highly welcome, although the correct semantics in this case seem ambiguous: what does the user expect when some of the output fields are already set?

In general, always make sure to pass only the questions (or example objects with only questions set) to your program. (Or, at least, unset the output values you expect dsp.generate to produce, when there are multiple output fields.)

Sounds good! I have created a PR for that

okhat commented 8 months ago

Thanks for opening this, this got fundamentally resolved in v2 (DSPy). Closing.

stanfordnlp / dspy

text-davinci-001: majority_vote_ function does not handle cases where pred[prediction_field] is a list of str instead of str #47