interactive-cookbook / recipe-generation

Apache License 2.0
0 stars 0 forks source link

recipe-generation

This repository contains the code for the paper from From Sentence to Action: Splitting AMR Graphs for Recipe Instructions, DMR 2023.

It contains the scripts to separate sentence-level AMR graphs into AMR graphs for individual action events and to generate action-event level recipe instructions for the obtained AMRs.
The code for training the generation model by fine-tuning a T5-based AMR-to-text model for this task can be found in the recipe-generation-model repository.
The Wiki contains more details about the algorithms implemented and AMR graphs structures and file format.

The implemented steps of the overall pipeline are

  1. Parsing each recipe sentence by sentence into AMR graphs with recipe-level node-to-token alignments, see AMR Parsing
  2. Separating the AMR graphs into sub-graphs in order to get one AMR per action-event in the corresponding action graph for the recipe, see AMR Splitting
  3. Extracting approximated gold instructions for the split action-level amr graphs as well as extracting action-level instructions based on dependency information only, see Extraction
  4. Generating a recipe text based on an action graph, the amr graphs corresponding to each action node and a graph traversal, see Generating Recipe Texts

Requirements

Tested with Python 3.6 and newer versions.

Run pip install -e . in the main repository directory. This will enable successful import of all modules and functions within the repository. Additionally, this will already install most of the dependencies with the following two exceptions:

The pytorch library (e.g. 1.10.1)

Transformers from Huggingface (version 3 will probably not work): (e.g. 4.11.3). Depending on your OS and environment set up you can use one of these two commands for the installation.

Note: Running the AMR parser requires further dependencies than the ones listed in the current section (see amr_parsing Readme).

AMR Parsing

See the Readme in the amr_parsing folder for more details on creating the AMR representations of a recipe corpus and the requirements. If a dataset of recipe AMRs with node-to-token alignments is already available, the amr_parsing subfolder can be excluded to avoid the need to install the dependencies for the parser.

AMR Splitting

For the details on how the AMR splitting algorithm works see the Wiki.

Create a folder data in the main project folder. Add the folder with the ARA 1.1 corpus to the data folder and call it ara1.1.

Create a folder data_ara2 in the main project folder. Add the folder with the ARA2 corpus to the data_ara2 folder and call it ara2.0

Additionally, add the folder with the parsed sentence-level AMRs (including node-token alignments matching the token IDs of the ARA corpus) and call it recipe_amrs_sentences.

Instead of naming the folders as explained above, you can simply adapt the ARA_DIR and SENT_AMR_DIR variables in utils/paths.py to match your folder structure.

Folder structures should be

---data
  |---ara1.1
    |---dish1
       |---recipes
          |---dish1_0.conllu
          |---dish1_1.conllu
          ...
       |---alignments.tsv
    |---dish2
    ...
  |---recipe_amrs_sentences
     |---dish1
         |---dish1_0_sentences_amr.txt
         |---dish1_1_sentences_amr.txt
         ...
     |---dish2
     ...

Then run the amr_splitting.py script. It will run the AMR splitting algorithm on all AMRs in the recipe_amrs_sentences folder. The separated version of the corpus will be stored in the (automatically created) folder data/recipe_amrs_actions with one subfolder per dish, directly containing the .txt files for each recipe.

Additionally, two logging files will be created in the (automatically created) logs folder.

Extracting Gold Instructions

The separated AMRs that the splitting algorithm produces still include the original sentence corresponding to the original AMR as their '::snt'' meta data. In order to extract instructions for the separated action-level AMRs navigate to training/prepare_data_sets and run the following:

python generate_gold_action_instructions.py --sep_dir [sep_dir] --orig_dir [orig_dir] --ara_dir [ara_dir] --out_dir [out_dir] --text

For more details about the extraction itself see the wiki page.

Generating Recipe Texts

In order to generate one action-event level recipe based on a (specific) action graph, run
python generate_recipe.py --file [action_graph_file] --cont [context_len] --order [ordering_version] --config [configuration_file] --out [output_file]

In order to generate all action-event level recipes of a dataset split run
python generate_data_set_split.py --split [split_file] --type [split_type] --cont [context_length] --order [ordering_version] --config [configuration_file] --out [output_directory]

ordering_version
Can be "top", "ids", "pf", "pf-lf" or "pf-lf-id", see wiki page for details of the different traversals, or can be set to "all" to generate one recipe text for each ordering.

configuration_file
For more information about the configuration files for recipe generation see the recipe-generation-model readme. For generating from an action graph, the configuration file only needs to include the "generator_args" parameter dict.
The specified "model_name_or_path" / "tokenizer_name_or_path need to point to a directory of a trained T5 based amr-to-text generation model which needs to include all the files saved when running the huggingface methods to save a model and a tokenizer.

split_file
The path to a .tsv file with the assignment of the recipes to different splits.

train    baked_ziti_7
train    garam_masala_8
val      waffles_9
test     cauliflower_mash_1
...

It is also possible to pass a file that was obtained by running create_recipe2split_assignment from the recipe-generation-model repository (see the Reproducible Split section). The script will take care of removing the leading path and removing the "_gold.txt" suffix.

split_type
Should be "train", "val" or "test" if the split file has the format shown above, but can be set to any value that occurs in the first column of the split file. Then all recipes where the value in the first column is equal to split_type are chosen for generation

Making use of Coreference Information

The repository also contains scripts to use coreference information for making implicit arguments explicit, switching between explicit NP mentions and pronouns, creating a variation of the recipe corpus and optionally including the information for the syntactic-dependency based splitting. This code (located in the coref_processing folder) is at a preliminary state.

Creating Joined Coref Files

Information about coreference clusters, the corresponding AMR nodes and coreferences arising from the AMR splitting can be obtained by running the coref_processing/create_joined_coref.py script.

This requires another subfolder of the data folder as described above which contains one subfolder per dish with .jsonlines coref files.

The paths to the action-level AMR graphs and to the coreference files is specified in utils/paths.py (ACTION_AMR_DIR and RAW_COREF_DIR). Also the path to the output folder that gets created and will contain the generated files is specified in the paths.py script (JOINED_COREF_DIR).

Details about the output format and information included can be found at the top of the script itself.