sinantie / NeuralAmr

Sequence-to-sequence models for AMR parsing and generation
http://www.ikonstas.net/code
69 stars 17 forks source link
amr generation parsing torch

Neural AMR

Torch implementation of sequence-to-sequence models for AMR parsing and generation based on the Harvard NLP framework. We provide the code for pre-processing, anonymizing, de-anonymizing, training and predicting from and to AMR. We are also including pre-trained models on 20M sentences from Gigaword and fine-tuned on the AMR LDC2015E86: DEFT Phase 2 AMR Annotation R1 Corpus. You can find all the details in the following paper:

Requirements

The pre-trained models only run on GPUs, so you will need to have the following installed:

Installation

(Only for training models)

(Only for downloading the pretrained models)

Usage

AMR Generation

You can generate text from AMR graphs using our pre-trained model on 20M sentences from Gigaword, in two different ways:

You can optionally provide an argument that tells the system to accept either full AMR as described in the annotation guidelines, or a stripped version, which removes variables, senses, parentheses from leaves, and assumes a simpler markup for Named Entities, date mentions, and numbers. You can also provide the input in anonymized format, i.e., similar to stripped but with Named Entities, date mentions, and numbers anonymized.

An example using the full format:

(h / hold-04 :ARG0 (p2 / person :ARG0-of (h2 / have-org-role-91 :ARG1 (c2 / country :name (n3 / name :op1 "United" :op2 "States")) :ARG2 (o / official)))  :ARG1 (m / meet-03 :ARG0 (p / person  :ARG1-of (e / expert-01) :ARG2-of (g / group-01))) :time (d2 / date-entity :year 2002 :month 1) :location (c / city  :name (n / name :op1 "New" :op2 "York")))

The same example using the stripped format:

hold :ARG0 ( person :ARG0-of ( have-org-role :ARG1 (country :name "United States") :ARG2 official)) :ARG1 (meet :ARG0 (person  :ARG1-of expert :ARG2-of  group)) :time (date-entity :year 2002 :month 1) :location (city :name "New York" )

The same example using the anonymized format:

hold :ARG0 ( person :ARG0-of ( have-org-role :ARG1 location_name_0 :ARG2 official ) ) :ARG1 ( meet :ARG0 ( person :ARG1-of expert :ARG2-of group ) ) :time ( date-entity year_date-entity_0 month_date-entity_0 ) :location location_name_1

For full details and more examples, see [here]().

AMR Parsing

You can also parse text to the corresponding AMR graph, using our pre-trained model on 20M sentences from Gigaword.

Similarly to AMR generation, you can parse text in two ways:

You can optionally provide an argument to the scripts that inform them to either accept text and perform NE recognition and anonymization on it, or bypass this process entirely (textAnonymized).

Script Options (generate_amr.sh, generate_amr_single.sh, parse_amr.sh, parse_amr_single.sh)

(De-)Anonymization Process

The source code for the whole anonymization/deanonymization pipeline is provided under the java/AmrUtils folder. You can rebuild the code by running the script:

./rebuild_AmrUtils.sh

This should create the executable lib/AmrUtils.jar. The (de-)anonymization tools are generally controlled using the following shell script command (==Note== that it is automatically being called inside the lua code when parsing/generating, so generally you don't need to deal with it when running the scripts described above). The first argument denotes the specific (de-)anonymization to perform, the second argument specifies whether the input comes either from stdin or from a file, where each input is provided one per line:

./anonDeAnon_java.sh anonymizeAmrStripped|anonymizeAmrFull|deAnonymizeAmr|anonymizeText|deAnonymizeText input_isFile[true|false] input

There are four main operations you can perform with the tools, namely anonymization of AMR graphs, anonymization of text sentences, deAnonymization of (predicted) sentences, and deAnonymization of (predicted) AMR graphs.:

Finally, when running the tool with the input being in a file (provide the path as the 3rd argument of the script, and set the 2nd argument to true), you always need to provide the original files containing the AMR graphs/sentences only. The tool will then automatically create the corresponding anonymized file (*.anonymized), as well as the anonymization alignments' file (*.alignments) during anonymization. Similarly, when de-anonymizing it will automatically look for the (*.anonymized, and *.alignments) files and create a new resulting file with the extension (*.pred).

(De)-Anonymizing Parallel Corpus (e.g., LDC versions)

If you have a parallel corpus, such as the LDC2015E86 that was used to train the models in this work, or Little Prince, which is included in this repository as well for convenience, then you need to follow a slightly different procedure.

The idea is to use alignments between the AMR graphs and corresponding text, in order to accurately identify the entities that will get anonymized. The alignments can be either obtained using the unsupervised aligner by Nima Pourdamghani, or JAMR by Jeff Flanigan. If you are using the annotated LDC versions, then they should already be automatically aligned using the first aligner (use files under the folder alignments/). The code in this repository supports either or both types of alignments.

  1. In order to get alignments from JAMR on the provided Little Prince corpus do the following:
  1. Open settings.properties and make sure amr.down.base and amr.jamr.alignments point to the right folders (they point to Little Prince directory by default). You can also enable or disable some pre-processing functionalities from here as well, such as whether to use Named Entity clusters (person, organization, location, and other instead of the fine-grained AMR categories; you can alter the clusters through the property amr.concepts.ne) via the property amr.down.useNeClusters (default is true for preparing the corpus for Generation, and should be set to false for Parsing). Similarly, you might want to enable output of senses on concepts (e.g., say-01, instead of just say) via the property amr.down.outputSense (default is false for Generation and true for Parsing). Another important property is amr.down.input which specifies which portion of the corpus to process (default is training,dev,test which are the folder names in the LDC corpora and Little Prince).

  2. Preprocess and anonymize the corpus by executing the script:

    ./anonParallel_java.sh

    This will take care of the proper bracket pre-processing anonymization, and splitting of the corpus to training, dev and test source and target files that can be directly used for training and evaluating your models. There are three options to change there:

    • DATA_DIR which points to a directory that will hold pre-processed files with many interesting meta-data, such as vocabularies, alignments, anonymization pairs, histograms and so on.
    • OUT_DIR which refers to a directory which contains only the essential anonymized training, dev, test source and target files, as well as the anonymization alignments in separate files.
    • suffix which is a handy parameter for changing the name of OUT_DIR directory.
  3. (Generation only) De-anonymize and automatically evaluate the output of a model using averaged BLEU, METEOR and multiBLEU by executing the script:

    ./recomputeMetrics.sh [INPUT_PATH REF_PATH]

    The script contains three important options:

    • DATASET which refers to the portion of the set to evaluate against; default is dev (the other option is test).
    • DATA_PATHNAME which points to the preprocessed corpus directory created from the previous script, which contains the reference data. Normally, it should be the same as OUT_DIR from above.
    • INPUT_PATH which is the folder containing the file(s) with anonymized predictions. In case there are multiple files, for example from different epoch runs, then the code automatically processes all of them and reports back the one with the highest multiBLEU score.