neulab / prompt2model

prompt2model - Generate Deployable Models from Natural Language Instructions
Apache License 2.0
1.95k stars 173 forks source link

Add data transformation capability to dataset retrieval step #385

Closed saum7800 closed 8 months ago

saum7800 commented 10 months ago

Description

We are adding a new component DatasetTransformer. The code currently contains one version of DatasetTransformer: PromptBasedDatasetTransformer. It is used with the Dataset Retrieval step in order to transform retrieved data to a format that is more directly relevant to the task given by a user. Here is the broad flow:

  1. Dataset retriever chooses a dataset
  2. Automatic column selection chooses relevant columns from the dataset
  3. We remove columns from the dataset that were not chosen by the automatic column selection
  4. We define a new PromptBasedDatasetTransformer object, and call the transform_data function with the PromptSpec and the loaded dataset.
  5. transform_data creates a prompt and calls the APIAgent which returns a "plan" for carrying out the transformation. This prompt uses the prompt_spec and example rows of the retrieved dataset.
  6. Next, it creates a list of "transform_prompts", where each transform prompt contains the "plan", one sample from the retrieved dataset, and the PromptSpec. It then requests a batch complete from the APIAgent for all the transform_prompts.
  7. Finally, we extract the "input" and "output" keys from each response, and canonicalize the inputs and outputs into a dataset.
neubig commented 9 months ago

Hi @saum7800 I can take another look when you've had a moment to make the above revisions!

saum7800 commented 9 months ago

Hey @neubig , I have resolved the comments that seemed like easy fixes, and left comments for a couple of them we can discuss. Please re-review whenever you get a chance. Thanks!