reasoning-machines / pal

PaL: Program-Aided Language Models (ICML 2023)
https://reasonwithpal.com
Apache License 2.0
462 stars 58 forks source link

Can pal (or it's technique) be used to generate a pandas DataFrame? #15

Closed cworkschris closed 1 year ago

cworkschris commented 1 year ago

Hi, thanks for the exciting project - I had limited success safely evaluating python from gpt output, so this is great to see!

I was hoping to use the structured processing to emit data types rather than complicated math. But had no luck directly asking for that or trying to follow your pattern of giving examples first in the prompt, e.g.

Q: generate a pandas DataFrame with the largest 10 cities in the united states with their populations.
A: import pandas as pd
data = {'City': ['New York City', 'Los Angeles', 'Chicago', 'Houston', 'Philadelphia', 'Phoenix', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],
       'Population': [8550405, 3971883, 2720546, 2296224, 1567442, 1445632, 1469845, 1394928, 1300092, 1026908]}
df = pd.DataFrame(data)
Q: generate a pandas DataFrame with the largest 10 counties in the world with their area in square miles.

But the index error on the interface.run() seems to indicate that results were not obtainable.

Do you think this is possible? Or too different than the purpose of this project?

urialon commented 1 year ago

Hi @cworkschris , Thank you for your interest in our work!

Can you please clarify:

  1. What do you mean by "index error on the interface.run()"?
  2. Which model did you use?

Best, Uri

cworkschris commented 1 year ago

Hi Uri, (Sorry I kept waiting to see the ipython window floating around that I was working on and I finally found it! Darn huge linux uptime!) I was using 'code-davinci-002' A more complete example for you:

n [20]: p = """
    ...: Q: generate a pandas DataFrame with the largest 10 cities in the united states with their populations.
    ...: A: import pandas as pd
    ...: data = {'City': ['New York City', 'Los Angeles', 'Chicago', 'Houston', 'Philadelphia', 'Phoenix', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],
    ...:        'Population': [8550405, 3971883, 2720546, 2296224, 1567442, 1445632, 1469845, 1394928, 1300092, 1026908]}
    ...: df = pd.DataFrame(data)
    ...: Q: generate a pandas DataFrame with the largest 10 counties in the world with their area in square miles.
    ...: """

In [21]: a = interface.run(p)
EOL while scanning string literal (<string>, line 1)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[21], line 1
----> 1 a = interface.run(p)

File ~/tw/palchain/pal/pal/core/interface.py:139, in ProgramInterface.run(self, prompt, time_out, temperature, top_p, max_tokens, majority_at)
    137         results.append(exec_result)
    138 counter = Counter(results)
--> 139 return counter.most_common(1)[0][0]

IndexError: list index out of range

Thanks and have a good one! Chris

urialon commented 1 year ago

Hi @cworkschris ,

Your question revealed a bug that I introduced while implementing the support in majority-at-k evaluation. The problem happened when no results were produced (for example, the generated code did not return anything or was not valid). I just checked in a fix that returns None in this case.

However, it seems to me that your use case does not require PAL. PAL is intended for cases where you need to generate and run Python program to solve a natural language question.

For example, in your case, PAL can be useful to extract statistics and information about the information that is provided as natural language. If you need the DataFrame itself (rather than its execution results), you can just prompt a model such as code-davinci-002.

Please let me know if you have any more questions, I feel that I didn't fully get your intent. Best, Uri